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Preface 


While significant changes have been made in the current edition from its predecessor, the 
authors have tried to keep the discussion at the same level of accessibly, that is, less math- 
ematical than the measure theory approach but more rigorous than formula and recipe 
manuals. 

It has been said that probability is hard to understand, not so much because of its 
mathematical underpinnings but because it produces many results that are counter intuitive. 
Among practically oriented students, Probability has many critics. Foremost among these are 
the ones who ask, “What do we need it for?” This criticism is easy to answer because future 
engineers and scientists will come to realize that almost every human endeavor involves 
making decisions in an uncertain or probabilistic environment. This is true for entire fields 
such as insurance, meteorology, urban planning, pharmaceuticals, and many more. Another, 
possibly more potent, criticism is, “What good is probability if the answers it furnishes are 
not certainties but just inferences and likelihoods?” The answer here is that an immense 
amount of good planning and accurate predictions can be done even in the realm of uncer- 
tainty. Moreover, applied probability—often called statistics—does provide near certainties: 
witness the enormous success of political polling and prediction. 

In previous editions, we have treaded lightly in the area of statistics and more heavily 
in the area of random processes and signal processing. In the electronic version of this book, 
graduate-level signal processing and advanced discussions of random processes are retained, 
along with new material on statistics. In the hard copy version of the book, we have dropped 
the chapters on applications to statistical signal processing and advanced topics in random 
processes, as well as some introductory material on pattern recognition. 

The present edition makes a greater effort to reach students with more expository 
examples and more detailed discussion. We have minimized the use of phrases such as, 


xi 
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“it is easy to show...”, “it can be shown...”, “it is easy to see...,” and the like. Also, 
we have tried to furnish examples from real-world issues such as the efficacy of drugs, 
the likelihood of contagion, and the odds of winning at gambling, as well as from digital 
communications, networks, and signals. 

The other major change is the addition of two chapters on elementary statistics and its 
applications to real-world problems. The first of these deals with parameter estimation and 
the second with hypothesis testing. Many activities in engineering involve estimating para- 
meters, for example, from estimating the strength of a new concrete formula to estimating 
the amount of signal traffic between computers. Likewise many engineering activities involve 
making decisions in random environments, from deciding whether new drugs are effective to 
deciding the effectiveness of new teaching methods. The origin and applications of standard 
statistical tools such as the t-test, the Chi-square test, and the F-test are presented and 
discussed with detailed examples and end-of-chapter problems. 

Finally, many self-test multiple-choice exams are now available for students at the book 
Web site. These exams were administered to senior undergraduate and graduate students 
at the Illinois Institute of Technology during the tenure of one of the authors who taught 
there from 1988 to 2006. The Web site also includes an extensive set of small MATLAB 
programs that illustrate the concepts of probability. 

In summary then, readers familiar with the 3'¢ edition will see the following significant 
changes: 


e A new chapter on a branch of statistics called parameter estimation with many illus- 
trative examples; 

e A new chapter on a branch of statistics called hypothesis testing with many illustrative 
examples; 

e A large number of new homework problems of varying degrees of difficulty to test the 
student’s mastery of the principles of statistics; 

e A large number of self-test, multiple-choice, exam questions calibrated to the material 
in various chapters available on the Companion Web site. 

e Many additional illustrative examples drawn from real-world situations where the 
principles of probability and statistics have useful applications; 

e A greater involvement of computers as teaching/learning aids such as (i) graphical 
displays of probabilistic phenomena; (ii) MATLAB programs to illustrate probabilistic 
concepts; (iii) homework problems requiring the use of MATLAB/ Excel to realize 
probability and statistical theory; 

e Numerous revised discussions—based on student feedback—meant to facilitate the 
understanding of difficult concepts. 


Henry Stark, HT 
Professor Emeritus 


John W. Woods, Rensselaer 
Professor 


P Introduction to Probability 


1.1 INTRODUCTION: WHY STUDY PROBABILITY? 


One of the most frequent questions posed by beginning students of probability is, “Is 
anything truly random and if so how does one differentiate between the truly random 
and that which, because of a lack of information, is treated as random but really isn’t?” 
First, regarding the question of truly random phenomena, “Do such things exist?” As we 
look with telescopes out into the universe, we see vast arrays of galaxies, stars, and planets 
in apparently random order and position. 

At the other extreme from the cosmic scale is what happens at the atomic level. Our 
friends the physicists speak of such things as the probability of an atomic system being in 
a certain state. The uncertainty principle says that, try as we might, there is a limit to 
the accuracy with which the position and momentum can be simultaneously ascribed to a 
particle. Both quantities are fuzzy and indeterminate. 

Many, including some of our most famous physicists, believe in an essential random- 
ness of nature. Eugen Merzbacher in his well-known textbook on quantum mechanics [1-1] 
writes, 


The probability doctrine of quantum mechanics asserts that the indetermination, of 
which we have just given an example, is a property inherent in nature and not merely a 
profession of our temporary ignorance from which we expect to be relieved by a future 
better and more complete theory. The conventional interpretation thus denies the 
possibility of an ideal theory which would encompass the present quantum mechanics 
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but would be free of its supposed defects, the most notorious “imperfection” of quantum 
mechanics being the abandonment of strict classical determinism. 


But the issue of determinism versus inherent indeterminism need never even be consid- 
ered when discussing the validity of the probabilistic approach. The fact remains that there 
is, quite literally, a nearly uncountable number of situations where we cannot make any 
categorical deterministic assertion regarding a phenomenon because we cannot measure all 
the contributing elements. Take, for example, predicting the value of the noise current i(t) 
produced by a thermally excited resistor R. Conceivably, we might accurately predict i(t) 
at some instant t in the future if we could keep track, say, of the 10?° or so excited electrons 
moving in each other’s magnetic fields and setting up local field pulses that eventually all 
contribute to producing i(t). Such a calculation is quite inconceivable, however, and there- 
fore we use a probabilistic model rather than Maxwell’s equations to deal with resistor noise. 
Similar arguments can be made for predicting the weather, the outcome of tossing a real 
physical coin, the time to failure of a computer, dark current in a CMOS imager, and many 
other situations. Thus, we conclude: Regardless of which position one takes, that is, deter- 
minism versus indeterminism, we are forced to use probabilistic models in the real world 
because we do not know, cannot calculate, or cannot measure all the forces contributing to 
an effect. The forces may be too complicated, too numerous, or too faint. 

Probability is a mathematical model to help us study physical systems in an average 
sense. We have to be able to repeat the experiment many times under the same conditions. 
Probability then tells us how often to expect the various outcomes. Thus, we cannot use 
probability in any meaningful sense to answer questions such as “What is the probability 
that a comet will strike the earth tomorrow?” or “What is the probability that there is life 
on other planets?” The problem here is that we have no data from similar “experiments” 
in the past. 

R. A. Fisher and R. Von Mises, in the first third of the twentieth century, were 
largely responsible for developing the groundwork of modern probability theory. The modern 
axiomatic treatment upon which this book is based is largely the result of the work by Andrei 
N. Kolmogorov [1-2]. 


1.2 THE DIFFERENT KINDS OF PROBABILITY 


There are essentially four kinds of probability. We briefly discuss them here. 


Probability as Intuition 


This kind of probability deals with judgments based on intuition. Thus, “She will probably 
marry him” and “He probably drove too fast” are in this category. Intuitive probability 
can lead to contradictory behavior. Joe is still likely to buy an imported Itsibitsi, world 
famous for its reliability, even though his neighbor Frank has a 19-year-old Buick that has 
never broken down and Joe’s other neighbor, Bill, has his Itsibitsi in the repair shop. Here 
Joe may be behaving “rationally,” going by the statistics and ignoring, so-to-speak, his 
personal observation. On the other hand, Joe will be wary about letting his nine-year-old 
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daughter Jane swim in the local pond, if Frank reports that Bill thought that he might 
have seen an alligator in it. This despite the fact that no one has ever reported seeing 
an alligator in this pond, and countless people have enjoyed swimming in it without ever 
having been bitten by an alligator. To give this example some credibility, assume that the 
pond is in Florida. Here Joe is ignoring the statistics and reacting to, what is essentially, 
a rumor. Why? Possibly because the cost to Joe “just-in-case” there is an alligator in the 
pond would be too high [1-3]. 

People buying lottery tickets intuitively believe that certain number combinations like 
month/day/year of their grandson’s birthday are more likely to win than say, 06-06-06. 
How many people will bet even odds that a coin that, heretofore has behaved “fairly,” that 
is, in an unbiased fashion, will come up heads on the next toss, if in the last seven tosses it 
has come up heads? Many of us share the belief that the coin has some sort of memory and 
that, after seven heads, that coin must “make things right” by coming up with more tails. 

A mathematical theory dealing with intuitive probability was developed by 
B. O. Koopman [1-4]. However, we shall not discuss this subject in this book. 


Probability as the Ratio of Favorable to Total Outcomes 
(Classical Theory) 


In this approach, which is not experimental, the probability of an event is computed a priori! 
by counting the number of ways ng that F can occur and forming the ratio ng/n, where 
n is the number of all possible outcomes, that is, the number of all alternatives to EF plus 
ng. An important notion here is that all outcomes are equally likely. Since equally likely 
is really a way of saying equally probable, the reasoning is somewhat circular. Suppose we 
throw a pair of unbiased six-sided dice? and ask what is the probability of getting a 7. We 
partition the outcome space into 36 equally likely outcomes as shown in Table 1.2-1, where 
each entry is the sum of the numbers on the two dice. 


Table 1.2-1 Outcomes of Throwing 


Two Dice 
Ist die 

2nddie | 1 2 38 4 5 6 
1 2 3 4 5 6 7 
2 3 4 =5 6 7 8 
3 4 5 6 7 8 9 
4 5 6 7 8 9 10 
5 6 7 8 9 10 11 
6 7 8 9 10 11 «12 


+A priori means relating to reasoning from self-evident propositions or prior experience. The related 
phrase, a posteriori means relating to reasoning from observed facts. 
*We will always assume that our dice have six sides. 
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The total number of outcomes is 36 if we keep the dice distinct. The number of ways 

of getting a 7 is n7 = 6. Hence 
6 1 
P\igetting a 7] = — = -. 
[getting a7] = 2. = 5 
Example 1.2-1 
(toss a fair coin twice) The possible outcomes are HH, HT, TH, and TT. The probability 
of getting at least one tail T is computed as follows: With E denoting the event of getting 

at least one tail, the event FE is the set of outcomes 


E = {HT,TH,TT}. 


Thus, event £ occurs whenever the outcome is HT or TH or TT. The number of elements 
in £ is ng = 3; the number of all outcomes N, is four. Hence 


Plat least one T] = — = -. 

[ = 7 

Note that since no physical experimentation is involved, there is no problem in postulating 
an ideal “fair coin.” Effectively, in classical probability every experiment is considered 
“fair.” 


The classical theory suffers from at least two significant problems: (1) It cannot deal 
with outcomes that are not equally likely; and (2) it cannot handle an infinite number 
of outcomes, that is when n = oo. Nevertheless, in those problems where it is impractical 
to actually determine the outcome probabilities by experimentation and where, because of 
symmetry considerations, one can indeed argue equally likely outcomes, the classical theory 
is useful. 

Historically, the classical approach was the predecessor of Richard Von Mises’ [1-6] 
relative frequency approach developed in the 1930s, which we consider next. 


Probability as a Measure of Frequency of Occurrence 


The relative frequency approach to defining the probability of an event F is to perform 
an experiment n times. The number of times that E appears is denoted by ng. Then it is 
tempting to define the probability of E occurring by 


P(E] = lim Ss (1.2-1) 
Quite clearly since ng <n we must have 0 < P[E] < 1. One difficulty with this approach 
is that we can never perform the experiment an infinite number of times, so we can only 
estimate P|E] from a finite number of trials. Secondly, we postulate that ng/n approaches 
a limit as n goes to infinity. But consider flipping a fair coin 1000 times. The likelihood 
of getting exactly 500 heads is very small; in fact, if we flipped the coin 10,000 times, the 
likelihood of getting exactly 5000 heads is even smaller. As n — oo, the event of observing 
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exactly n/2 heads becomes vanishingly small. Yet our intuition demands that P{head] = 
for a fair coin. Suppose we choose a 6 > 0; then we shall find experimentally that if the coi 
is truly fair, the number of times that 


L 
2 
n 


NE iL 
—-- : 1.2-2 
n 5; ent ( ) 


as n becomes large, becomes very small. Thus, although it is very unlikely that at any stage 
of this experiment, especially when n is large, ng/n is exactly + this ratio will nevertheless 


hover around 3, and the number of times it will make significant excursion away from the 


29 
vicinity of $ according to Equation 1.2-2, becomes very small indeed. 
Despite these problems with the relative frequency definition of probability, the relative 


frequency concept is essential in applying probability theory to the physical world. 


Example 1.2-2 
(random.org) An Internet source of random numbers is RANDOM.ORG, which was founded 
by a professor in the School of Computer Science and Statistics at Trinity College, Dublin, 
Ireland. It calculates random digits as a function of atmospheric noise and has passed 
many statistical tests for true randomness. Using one of the site’s free services, we have 
downloaded 10,000 random numbers, each taking on values from 1 to 100 equally likely. We 
have written the MATLAB function RelativeFrequencies() that takes this file of random 
numbers and plots the ratio ng/n as a function of the trial number n = 1,..., 10,000. We 
can choose the event EF to be the occurrence of any one of the 100 numbers. For example for 


BA {occurrence of number 5}, the number nz counts the number of times 5 has occurred 
among the 10,000 numbers up to position n. A resulting output plot is shown in Figure 1.2- 
1, where we see a general tendency toward convergence to the ideal value of 0.01 = 1/100 
for 100 equally likely numbers. An output plot for another number choice 23 is shown in 
Figure 1.2-2 again showing a general tendency to converge to the ideal value here of 0.01. In 
both cases though, we note that the convergence is not exact at any value of n, but rather 
just convergence to a small neighborhood of the ideal value. 
This program is available at this book’s website. 


Probability Based on an Axiomatic Theory 


The axiomatic approach is followed in most modern textbooks on the subject. To develop it 
we must introduce certain ideas, especially those of a random experiment, a sample space, 
and an event. Briefly stated, a random experiment is simply an experiment in which the 
outcomes are nondeterministic, that is, more than one outcome can occur each time the 
experiment is run. Hence the word random in random experiment. The sample space is the 
set of all outcomes of the random experiment. An event is a subset of the sample space that 
satisfies certain constraints. For example, we want to be able to calculate the probability for 
each event. Also in the case of noncountable or continuous sample spaces, there are certain 
technical restrictions on what subsets can be called events. An event with only one outcome 
will be called a singleton or elementary event. These notions will be made more precise in 
Sections 1.4 and 1.5. 
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1000 2000 3000 4000 5000 6000 7000 8000 9000 10,000 


Figure 1.2-1 Plot of ne/n for E = {occurrence of number 5} versus n from atmospheric noise 
(from website RANDOM. ORG). 


1000 2000 3000 4000 5000 6000 7000 8000 9000 10,000 


Figure 1.2-2 Plot of ne/n for E = {occurrence of number 23} versus n from atmospheric noise (from 
website RANDOM. ORG). 
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1.3 MISUSES, MISCALCULATIONS, AND PARADOXES IN PROBABILITY 


The misuse of probability and statistics in everyday life is quite common. Many of the 
misuses are illustrated by the following examples. Consider a defendant in a murder trial 
who pleads not guilty to murdering his wife. The defendant has on numerous occasions 
beaten his wife. His lawyer argues that, yes, the defendant has beaten his wife but that 
among men who do so, the probability that one of them will actually murder his wife is 
only 0.001, that is, only one in a thousand. Let us assume that this statement is true. It 
is meant to sway the jury by implying that the fact of beating one’s wife is no indicator 
of murdering one’s wife. Unfortunately, unless the members of the jury have taken a good 
course in probability, they might not be aware that a far more significant question is the 
following: Given that a battered wife is murdered, what is the probability that the husband is 
the murderer? Statistics show that this probability is, in fact, greater than one-half. 

In the 1996 presidential race, Senator Bob Dole’s age became an issue. His opponents 
claimed that a 72-year-old white male has a 27 percent risk of dying in the next five years. 
Thus it was argued, were Bob Dole elected, the probability that he would fail to survive his 
term was greater than one-in-four. The trouble with this argument is that the probability 
of survival, as computed, was not conditioned on additional pertinent facts. As it happens, 
if a 72-year-old male is still in the workforce and, additionally, happens to be rich, then 
taking these additional facts into consideration, the average 73-year-old (the age at which 
Dole would have assumed the presidency) has only a one-in-eight chance of dying in the 
neat four years [1-3]. 

Misuse of probability appears frequently in predicting life elsewhere in the universe. 
In his book Probability 1 (Harcourt Brace & Company, 1998), Amir Aczel assures us 
that we can be certain that alien life forms are out there just waiting to be discovered. 
However, in a cogent review of Aczel’s book, John Durant of London’s Imperial College 
writes, 


Statistics are extremely powerful and important, and Aczel is a very clear and capable 
exponent of them. But statistics cannot substitute for empirical knowledge about the 
way the universe behaves. We now have no plausible way of arriving at robust estimates 
about the way the universe behaves. We now have no plausible way of arriving at 
robust estimates for the probability of life arriving spontaneously when the conditions 
are right. So, until we either discover extraterrestrial life or understand far more about 
how at least one form of life—terrestrial life—first appeared, we can do little more 
than guess at the likelihood that life exists elsewhere in the universe. And as long as 
we're guessing, we should not dress up our interesting speculations as mathematical 
certainties. 


The computation of probabilities based on relative frequency can lead to paradoxes. An 
excellent example is found in [1-3]. We repeat the example here: 


In a sample of American women between the ages of 35 and 50, 4 out of 100 develop 
breast cancer within a year. Does Mrs. Smith, a 49-year-old American woman, there- 
fore have a 4% chance of getting breast cancer in the next year? There is no answer. 
Suppose that in a sample of women between the ages of 45 and 90—a class to which 
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Mrs. Smith also belongs—11 out of 100 develop breast cancer in a year. Are 
Mrs. Smith’s chances 4%, or are they 11%? Suppose that her mother had breast cancer, 
and 22 out of 100 women between 45 and 90 whose mothers had the disease will develop 
it. Are her chances 4%, 11%, or 22%? She also smokes, lives in California, had two 
children before the age of 25 and one after 40, is of Greek descent .... What group 
should we compare her with to figure out the “true” odds? You might think, the more 
specific the class, the better—but the more specific the class, the smaller its size and 
the less reliable the frequency. If there were only two people in the world very much 
like Mrs. Smith, and one developed breast cancer, would anyone say that Mrs. Smith’s 
chances are 50%? In the limit, the only class that is truly comparable with Mrs. Smith 
in all her details is the class containing Mrs. Smith herself. But in a class of one 
“relative frequency” makes no sense. 


The previous example should not leave the impression that the study of probability, 
based on relative frequency, is useless. For one, there are a huge number of engineering and 
scientific situations that are not nearly as complex as the case of Mrs. Smith’s likelihood of 
getting cancer. Also, it is true that if we refine the class and thereby reduce the class size, 
our estimate of probability based on relative frequency becomes less stable. But exactly 
how much less stable is deep within the realm of the study of probability and its offspring 
statistics (e.g., see the Law of Large Numbers in Section 4.4). Also, there are many situations 
where the required conditioning, that is, class refinement, is such that the class size is 
sufficiently large for excellent estimates of probability. And finally returning to Mrs. Smith, 
if the class size starts to get too small, then stop adding conditions and learn to live with 
a probability estimate associated with a larger, less refined class. This estimate may be 
sufficient for all kinds of actions, that is, planning screening tests, and the like. 


1.4 SETS, FIELDS, AND EVENTS 


A set is a collection of objects, either concrete or abstract. An example of a set is the set of 
all New York residents whose height equals or exceeds 6 feet. A subset of a set is a collection 
that is contained within the larger set. Thus, the set of all New York City residents whose 
height is between 6 and 64 feet is a subset of the previous set. In probability theory we call 
sets events. We are particularly interested in the set of all outcomes of a random experiment 
and subsets of this set. We denote the set of all outcomes by 2, and individual outcomes 
by ¢.' The set Q is called the sample space of the random experiment. Certain subsets of 
Q, whose probabilities we are interested in, are called events. In particular (Q itself is called 
the certain event and the empty ¢ set is called the null event. 


Examples of Sample Spaces 


Example 1.4-1 
(coin flip) The experiment consists of flipping a coin once. Then 2 = {H, T}, where H isa 
head and T is a tail. 


+ Greek letter ¢ is pronounced zeta. 
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Example 1.4-2 
(coin flip twice) The experiment consists of flipping a coin twice. Then 2 = {HH,HT, 
TH, TT}. One of sixteen subsets of Q is E={HH, HT, TH}; it is the event of getting at least 
one head in two flips. 


Example 1.4-3 
(hair on head) The experiment consists of choosing a person at random and counting the 
hairs on his or her head. Then 


Q = {0,1,2,...,107}, 


that is, the set of all nonnegative integers up to 10”, it being assumed that no human head 
has more than 10’ hairs. 


Example 1.4-4 
(couple’s ages) The experiment consists of determining the age to the nearest year of each 
member of a married couple chosen at random. Then with x denoting the age of the man 
and y denoting the age of the woman, 22 is described by 


Q = {2-tuples (a, y): a any integer in 10—200; y any integer in 10—200}. 


Note that in Example 1.4-4 we have assumed that no human lives beyond 200 years and that 
no married person is ever less than ten years old. Similarly, in Example 1.4-1, we assumed 
that the coin never lands on edge. If the latter is a possible outcome, it must be included 
in 2 in order for it to denote the set of all outcomes as well as the certain event. 


Example 1.4-5 
(angle in elastic collision) The experiment consists of observing the angle of deflection of a 
nuclear particle in an elastic collision. Then 


Q={0: -t7<0<7}. 


An example of an event or subset of 2 is 


Example 1.4-6 
(electrical power) The experiment consists of measuring the instantaneous power P consumed 
by a current-driven resistor. Then 


Q={P: P > 0}. 


Since power cannot be negative, we leave out negative values of P in Q. A subset of Q is 
the event E = {P > 1073 watts}. 
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Note that in Examples 1.4-5 and 1.4-6, the number of elements in 2 is uncountably infinite. 
Therefore, there are an uncountably infinite number of subsets. When, as in Example 1.4-4, 
the number of outcomes is finite, the number of distinct subsets is also finite, and each 
represents an event. Thus, if Q = {¢,,...,¢,}, the number of possible subsets of Q is 2. 
We can see this by noting that each element ¢, either is or is not present in any given subset 
of Q. This gives rise to 2% distinct subsets or events, including the certain event and the 
impossible or null event. 


Review of set theory. The union (sum) of two sets F and F’, written EU F or E+ F, is 
the set of all elements that are in at least one of the sets E and F’. Thus, with E = {1,2,3,4} 
and F = {1,3,4,5,6},' 

EUF = ({1,2,3,4,5, 6}. 
If E is a subset of F’, we indicate this by writing E C F. Clearly for E C F it follows that 
EUF = F. We indicate that ¢ is an element of 2 or “belongs” to Q by writing ¢ € . 
Thus, we can write 


EUF={¢:C€EorCé F}, (1.4-1) 


where the “or” here is inclusive. Clearly EH UF = FU E. The intersection or set product 
of two sets EF and F, written EM F or just EF, is the set of elements common to both E 
and F’. Thus, in the preceding example 


EF = {1,3,4}. 


Formally, EF 2 {¢: ¢ € Eand ¢ € F} = FE. The complement of a set E, written E°, is 
the set of all elements not in E. From this it follows that if Q is the sample space or, more 
generally, the universal set, then 


EUE‘=9. (1.4-2) 


Also EE®° = ¢. The set difference of two sets or, more appropriately, the reduction of EF by 
F, written E' — F, is the set made up of elements in F that are not in F’. It should be clear 
that 


BaP BF, 
F-ES FE, 


but be careful. Set difference does not behave like difference of numbers, for example, 
F-E-E=F-—E. The exclusive or of two sets, written E @ F, is the set of all elements 
in £ or F but not both. It is readily shown that* 


E®F=(E-F)U(F-E). (1.4-3) 


+Remember, the order of the elements in a set is not important. 

*Equation 1.4-3 shows why U is preferable to + to indicate union. The beginning student might—in 
error—write (FE — F)+(F -— £) = E-F+4+F-— E=0, which is meaningless. Note also that F + F £ 2F, 
which is also meaningless. In fact F + F = F. So, only use + and — operators in set theory with care. 
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(a) EUF (b) EF 
E 
(d) E-F (e) F—E (g) EDF 


Figure 1.4-1_ Venn diagrams for set operations. 


The operation of unions, intersections, and so forth can be illustrated by Venn diagrams, 
which are useful as aids in reasoning and in establishing probability relations. The various 
set operations EU F, EF, E°, E- F, F — E, E@ F are shown in Figure 1.4-1 in hatch 
lines. 

Two sets E, F are said to be disjoint if EF = ¢; that is, they have no elements in 
common. Given any set F, an n-partition of E consists of a sequence of sets E;, where 
i=1,...,n, such that FE; C E, Ly Ej, = E, and E,E; = ¢ for all i 4 j. Thus, given two 
sets E, F, a 2-partition of F is 


F=FEUFE. (1.4-4) 
It is easy to see, using Venn diagrams, the following results: 


(BUF) = E°F° (1.4-5) 
(EF)° = E°UF® (1.4-6) 


and, by induction,’ given sets E,,..., En: 


+See Section A.4 in Appendix A for the meaning of mathematical induction. 
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Ue =() (1.4-7) 
i=1 i=1 
a = (Ja. (1.4-8) 
i=1 i=1 


The relations are known as De Morgan’s laws after the English mathematician Augustus 
De Morgan (1806-1871). 

While Venn diagrams allow us to visualize these results, they don not really achieve their 
proof. Towards this end, consider the mathematical definition of equality of two sets. Two 
sets F and F are said to be equal if every element in EF is in F and vice versa. Equivalently, 


E=F if ECF,andFCE. (1.4-9) 


Example 1.4-7 
(proving equality of sets) If we want to strictly prove one of the above set equalities, say 
Eq. 1.4-4, F = FEU FE‘, we must proceed as follows. First show F C FEU FE*® and then 
show F > FEU FE*. To show F Cc FEU FE*, we consider an arbitrary element ¢ € F, 
then ¢ must be in either FE or F'E* for any set FE, and thus ¢ must be in FEUFE*. This 
establishes that F C FEU FE*. Going the other way, to show F > FEU FE‘, we start 
with an arbitrary element ¢ © FEU FE*. It must be that ¢ belongs to either FE or FE 
and so ¢ must belong to F, thus establishing F > FEU FE*. Since we have shown the two 
set inclusions, we may write F = FE U FE® meaning that both sets are equal. 


Using this method you can establish the following helpful laws of set theory: 
1. associative law for union 
AU(BUC)=(AUB)UC 
2. associative law for intersection 
A(BC) = (AB)C 
3. distributive law for union 
AU (BC) = (AU B)(AUC) 
4. distributive law for intersection 
A(B UC) = (AB) U (AC) 
We will use these identities or laws for analyzing set equations below. However, these 
four laws must be proven first. Here, as an example, we give the proof of 1. 


Example 1.4-8 
(proof of associative law for union) We wish to prove AU(BUC) = (AU B)UC. To do this 
we must show that both AU(BUC) C (AU B)UC and AU(BUC) > (AU B)UC. Starting 
with the former, assume that ¢ € AU(BUC)); then it follows that ¢ is in A or in BUC. But 
then ¢ isin A or B or C, so it is in AUB or in C, which is the same as saying ¢ € (AU B)UC. 
To complete the proof, we must go the other way and show starting with ¢ € (AU B)UC 
that ¢ must also be an element of AU(BUC). This part is left for the student. 
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Sigma fields. Consider a universal set Q and a certain collection of subsets of 2. Let E 
and F' be two arbitrary subsets in this collection. This collection of subsets forms a field 
Me if 


(1) ¢€.4,2€ 4. 
(2) IfEe. @and Fe.4@, then EUF €.4@,and EFe.4@.i 
(3) If FE €.4, then E° €.%@. 


We will need to consider fields of sets (fields of events in probability) in order to avoid 
some problems. If our collection of events were not a field, then we could define a probability 
for some events, but not for their complement; that is, we could not define the probability 
that these events do not occur! Similarly we need to be able to consider the probability of 
the union of any two events, that is, the probability that either ‘or both’ of the two events 
occur. Thus, for probability theory, we need to have a probability assigned to all the events 
in the field. 

Many times we will have to consider an infinite set of outcomes and events. In that 
case we need to extend our definition of field. A sigma (c) field? .7is a field that is 
closed under any countable number of unions, intersections, and complementations. Thus, 
if Fy,...,Fn,... belong to .Fso do 


U i; and () Kj, 
i=l i=l 
where these are simply defined as 


U E,& {the set of all elements in at least one E;} 
i=1 


and 


co 

() E,& {the set of all elements in every E;}. 

i=1 
Note that these two infinite operations of union and intersection would be meaningless 
without a specific definition. Unlike infinite summations which are defined by limiting oper- 
ations with numbers, there are no such limiting operations defined on sets, hence the need 
for a definition. 


Events. Consider a probability experiment with sample space 2. If Q has a countable 
number of elements, then every subset of 2 may be assigned a probability in a way consistent 
with the axioms given in the next section. Then the class of all subsets will make up a field 
or o-field simply because every subset is included. This collection of all the subsets of Q is 
called the largest o-field. 


+From this it follows by mathematical induction, that if E,,..., En belongs to.% so do Un, Be Mb 
and 4 E;€ @. 
FAlso sometimes called a o-algebra. 
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Sometimes though we do not have enough probability information to assign a probability 
to every subset. In that case we need to define a smaller field of events that is still a o-field, 
but just a smaller one. We will discuss this matter in the example below. Going to the limit, 
if we have no probability information, then we must content ourselves with the smallest 
o-field of events consisting of just the null event ¢@ and the certain event 2. While this 
collection of two events is a o-field it is not very useful. 


Example 1.4-9 
(field generated by two events) Assume we have interest in only two events A and B in 
an arbitrary sample space 2 and we desire to find the smallest field containing these two 
events. We can proceed as follows. First we generate a disjoint decomposition of the sample 
space as follows. 


Q = Q(AU A‘)(BU B®) 
= ABU ABSU ATBU ASB®. 


Next generate a collection of events from these four basic disjoint (non-overlapping) events 
as follows: The first four events are AB, AB° A°B, and A°B°. Then we add the pairwise 
unions of these disjoint events: ABU AB‘, ABU ASB, and ABU A‘°B®. Finally we add 
the unions of tripples of these four disjoint events. The total number of events will then be 
2x2x2x2=24 = 16, since each of the four basic disjoint events can be included, or not, 
in the event. 

This collection of events is guarenteed to be a field, since we construct each of its 
16 events from the four basic disjoint events, thus ensuring that complements are in the 
collection via Q = ABUAB‘SUA‘S BUASB®. Unions are trivially in the collection too. Because 
all the events in the collection are built up from the four disjoint events, complements are 
just the events that have been left out, eg. (ABU AB‘)*° = A°BU A°B¢ which is recognized 
as being in the collection. Hence we have a field. In fact this is the smallest field that 
contains the events A and B. We call this the field generated by events A and B. Can you 
show that event A is in this field? 


When 2 is not countable, for example, when 2 = R! = the real line, advanced mathe- 
matics (measure theory) has found that not every subset of 2 can be assigned a probability 
(is measurable) in a way that will be consistent. So we must content ourselves with smaller 
collections of subsets of the universal event 2 that form a o-field. On the real line R! for 
example, we can generate a o-field from all the intervals, open/closed, and this is called the 
Borel field of events on the real line. As a practical matter, it has been found that the Borel 
field on the real line includes all subsets of engineering and scientific interest.t 

At this stage of our development, we have two of the three objects required for the 
axiomatic theory of probability, namely, a sample space 2 of outcomes ¢, and a o-field .¥ of 
events defined on Q. We still need a probability measure P. The three objects (Q,.% P) form 
a triple called the probability space 7 that will constitute our mathematical model. However, 
the probability measure P must satisfy the following three axioms due to Kolmogorov. 


+ For two-dimensional Euclidean sample spaces, the Borel field of events would be subsets of R! x R! = 
R?; for three-dimensional sample spaces, it would be subsets of R! x R! x R! = R°. 
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1.5 AXIOMATIC DEFINITION OF PROBABILITY 


Probability is a set function P|-] that assigns to every event EL €.%a number P[EF] called 
the probability of event E such that 


(1) P[E]>0. (1.5-1) 
(2) P[Q} =1. (1.5-2) 
(3) P/EUF)=P[E|+ PIF] if EF =¢. (1.5-3) 


The probability measure is not like an ordinary function. It does not take numbers for 
its argument, but rather it takes sets; that is, it is a measure of sets, our mathematical 
model for events. Since this is a special function, to distinguish it we will always use square 
brackets for its argument, a set of outcomes ¢ in the sample space 2. 

These three axioms are sufficient’ to establish the following basic results, all but one of 
which we leave as exercises for the reader. Let F and F' be events contained in.Y% then 


peas (1.5-4) 

(5) P[EF _ P[E] — P[EF), (1.5-5) 

(6) P[E] =1- PB). (15-6) 

(7) P[EUF] = P[E] + P[F] — P[EF). eS, 
From Axiom 3 we can establish by mathematical induction that 

P UE =e ‘Jif E;E;=o forall iF). (1.5-8) 


From this result and Equation 1.5-7, we can establish by induction, the general result 
that P [Uj_, Ei] < iL, P[Ei]. This result is sometimes known as the union bound, often 
used in digital communications theory to provide an upper bound on the probability of 
error. 


Example 1.5-1 
(probability of the union of two events) We wish to prove result (7). First we decompose 
the event EU F into three disjoint events as follows: 


BUP=EFTUETFUEF. 


By Axiom 3 
P|EU F| = P[EF¢U E°F] + P[EF| 
[EF°) + P|E°F|+ P|EF|, by Axiom 3 again 
[E] — P[EF] + P[F] — P[EF|+ PEF] 
[E] + PLF] — P[BF'. (1.5-9) 


TA fourth axiom: P [U%, Ei] = DP, P[Ei] if BiB; = ¢ for all i A j must be included to enable one 
to deal rigorously with limits and countable unions. This axiom is of no concern to us here but will be in 
later chapters. 
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We can apply this result to the following problem. 
In a certain bread store, there are two events of interest W = {white bread is available} 


and R & {rye bread is available}. Based on past experience with this establishment, we 
take P[W] = 0.8 and, P[R] = 0.7. We also know that the probability that both breads are 
present is 0.6, that is, P[WR] = 0.6. We now ask what is the probability of either bread 
being present, that is, what is P[W U R]? The answer is obtained basic result (7) as 


P[W UR] = P[W] + P[R] — P[wR] 
=0.8+0.7—0.6 
= 0.9. 


We pause for a bit of terminology. We say an event FE occurs whenever the outcome 
of our experiment is one of the elements in E. So “P[E]” is read as “the probability that 
event E occurs.” 


A measure of events not outcomes. The reader will have noticed that we talk of 
the probability of events rather than the probability of outcomes. For finite and countable 
sample spaces, we could just as well talk about probabilities of outcomes; however, we do 
not do so for several reasons. One is we would still need to talk about probabilities of events 
and so would need two types of probability measures, one for outcomes and one for events. 
Second, in some cases we only know the probability of some events and don’t know the 
probabilistic detail to assign a probability to each outcome. Lastly, and most importantly, 
in the case of continuous sample spaces with uncountable outcomes, for example the points 
on the real number interval [0,5] these may well have zero probability, and hence any theory 
based on probability of the outcomes would be useless. For these and other reasons we base 
our approach on events, and so probability measures events not outcomes. 


Example 1.5-2 
(toss coin once) The experiment consists of throwing a coin once. Our idealized outcomes 
are then H and T, with sample space: 


OLiH Th, 


The o-field of events consists of the following sets: {H}, {T}, Q, ¢. With the coin assumed 
fair, we have! 


PHY =PETY=3, PM|=1, Pld =0. 


Example 1.5-3 
(toss die once) The experiment consists of throwing a die once. The outcomes are the 
number of dots ¢ = 1,...,6, appearing on the upward facing side of the die. The sample 


*+Remember the outcome ¢ is the output or result of our experiment. The set of all outcomes is the 
sample space 2. 
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space Q is given by Q = {1,2,3,4,5,6}. The event field consists of 2° events, each one 
containing, or not, each of the outcomes 7. Some events are 


@, O, {1}, {1, 2}, {1, 2,3}, {1,4,6}, and {1,2,4,5}. 
We assign probabilities to the elementary or singleton events {C¢} : 
PHO) =¢% t= 1y0345.6: 


All probabilities can now be computed from the basic axioms and the assumed probabilities 
for the elementary events. For example, with A = {1} and B = {2,3} we obtain P[A] = §. 
Also P[AU B] = P[A] + P[B], since AB = ¢. Furthermore, P[B] = P[{2}] + P[{3}] = 2 so 
that 

1 


P[AUB] = . 


12 = 
6 +6 
Example 1.5-4 
(choose ball from urn) The experiment consists of picking at random a numbered ball from 
12 balls numbered 1 to 12 in an urn. Our idealized outcomes are then the numbers ¢ = 1 


to 12, with sample space: 


Q = {1,..., 12}. 
Let the following events be specified 
Ai = {1,...,6}, B= 4{3,02559} 
AU B=  {l,...,9}, AB = {3,4,5,6}, AB® = {1,2} 
Be = {1,2,10,11, 12}, Af = {7,...,12}, A°B° = {10, 11,12} 
(AB)*° = {1,2,7,8,9, 10, 11, 12}. 


Hence 
P[A] = Pl{1}] + P[{2}] +... + PL{6}], 
P[B] = P[{3}] PL{9}], 
P[AB] = P[{3}] +... + PL{6}] 
If P[{1}] =... = P[{12}] = i, then P[A] = 4, P[B] = 4, P[AB] = 4, and so forth. 


We point out that a theory of probability could be developed from a slightly different set of 
axioms [1-7]. However, whatever axioms are used and whatever theory is developed, for it 
to be useful in solving problems in the physical world, it must model our empirical concept 
of probability as a relative frequency and the consequences that follow from it. 


+ We say event A occurs whenever any of the number 1 through 6 appears on a ball removed from the 
urn. 
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Figure 1.5-1 Partitioning eae E; into seven disjoint regions A;,..., Az. 


Probability of union of events. The extension of Equation 1.5-7 to the case of three 
events is straightforward but somewhat tedious. We consider three events, E), Eo, £3, and 
wish to compute the probability, P Ley E;], that at least one of these events occurs. From 
the Venn diagram in Figure 1.5-1, we see seven disjoint regions in wie E;, which we label 
as A;,i=1,...,7. You can prove it using the same method used in Example 1.4-9. Then 
P(U3_, Bil = P (UL, Ad = XL, PIAG, from Axiom 3. 

In terms of the original events, the seven disjoint regions can be identified as 


A, = E, ESES = E,(E2U E3)°, 
Ao = E,E2ES, 
A3 — EY E,ES = Eo(Ey U ays 


Ag = E, £2 Es, 
As = E, ESEs, 
Ag = EY E2E3, 


Az = ES ESE3 = E3(Ey U E2)°. 


The computations of the probabilities P[A;], i = 1,...,7, follow from Equations 1.5-5 
and 1.5-7. Thus, we compute 
P[Ay| = P[E\| — P[E, Eo U FE, Es] 
= P([E,] — {P[£1 £9] + P[E E3] — P[E, E2E3}}. 
In obtaining the first line, we used Equation 1.5-5. In obtaining the second line, we used 


Equation 1.5-7. The computations of P{A;], 1 = 3,7, are quite similar to the computation 
P{[A,] and involve the same sequence of steps. Thus, 
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P[A3] = P[|E] — {P| £1 Ey] + P| E2E3] — P| E, EF E3)}, 
P[A7] = P[E3] — {P[|E1 Es] + P| E2 #3] — P[E E2Es}}, 


The computations of P[A»], P[As], and P[{Ag] are also quite similar and involve applying 


A 
P{Ag] = P| EF, Eo] — P[E\ E223], 

As] = P(E, E3] — P/E, £2 E3], 

P{Ag6] = P| E23] — P[E, E223], 
and finally, 
P[A4] = P[E. EBs]. 

Now, recalling that Pi Ej] = YS P{Aj], we merely add all the P[A;] to obtain the 
desired result. This gives 


3 
= > PIE] - (PIE: Bs] + P[E,E3] + P[E3Es]) + P[B,E2E3]. — (1.5-10) 


i=l 


Note that this result makes sense because in adding the measures of the three events we 
have counted the double overlaps twice each. But if we subtract these three overlaps, we 
have not counted £, E2E3 at all, and so must add it back in. If we adopt the notation 
P, 2 PlEj), P,; 2 P[E,B,], and P,;, 2 P[E,E;E,], where 1 < i < j < k < 3, we can 
rewrite Equation 1.5-10 as 


3 


Us 


i=1 


3 


=)°R- S- Pij + >» Pijk- 


i=l 1<i<j<3 1<i<j<k<3 


P 


The last sum contains only one term, namely P23. Denote now each sum by the symbol $7, 
where the / denotes the number of subscripts associated with the terms in that sum. Then 


3 3 
P U E;| = S; — So +53, where S; = ye, So = :> Pi, and 
i=l i=1 1<i<j<3 
A 
f= > Fix 
1<i<j<k<3 
Why this introduction of new notation? Using the symbols S;,/ = 1,..., we can extend 


Equation 1.5-10 to the general case. 


Theorem 1.5-1 (probability of union of n events) The probability P that at least 
one among the events EF), F2,...,H, occurs in a given experiment is given by 


P=S$,—So+...+Sn, 
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A n A n A 
where S1 = )ija1 Pir $2 = Dir cicjen Piss+++1 Sn = Vicicjcrc...<i<n Pisk...t- The last sum 
has n subscripts and contains only one term. 


The proof of this theorem is given in [1-8, p. 89]. It can also be proved by induction; 
that is, assume that P = S, — So +...+ 5S), is true. Then show that for the case n + 1, 
P=S,—S2+...FSn41. We leave this exercise for the braver reader. 


1.6 JOINT, CONDITIONAL, AND TOTAL PROBABILITIES; INDEPENDENCE 


Assume that we perform the following experiment: We are in a certain U.S. city and wish 
to collect weather data about it. In particular we are interested in three events, call them 
A, B, and C, where 


A is the event that on any particular day, the temperature equals or exceeds 10°C; 

B is the event that on any particular day, the amount of precipitation equals or exceeds 
5 millimeters; 

Cis the event that on any particular day A and B both occur, that is, C S AB. 


Since Cis an event, we can compute P[C] = P{[AB] and we call P[AB] the joint probability 
of the events A and B. This notion can obviously be extended to more than two events; 
that is, P[EFG] is the joint probability of events E, F, and G.' Now let ng denote the 
number of days on which event E occurred. Over a thousand-day period (n = 1000), the 
following observations are made: n4 = 811, ng = 306, nag = 283. By the relative frequency 
interpretation of probability 


na 811 
PAe == Stet 
4] =~" = i000 
P(B] a = 0.306, 
NAB 


P[AB] ~— 8 = 0.283. 
nr 


Consider now the ratio n4g/na. This would be the relative frequency with which event AB 
occurs given that event A occurs. Put into words, it is the fraction of time that the amount 
of precipitation equals or exceeds 5 millimeters on those days given that the temperature 
equals or exceeds 10°C. Thus, we are dealing with the frequency of an event given that or 
conditioned upon the fact that another event has occurred. Note that 

nap nap/n  P[AB| 


a aa (1.6-1) 


This empirical concept suggests that we introduce in our theory a conditional probability 
measure. 


+E, F, G are any three events defined on the same probability space. 


Sec. 1.6. JOINT, CONDITIONAL, AND TOTAL PROBABILITIES; INDEPENDENCE 21 


Conditional probability. The conditional probability P[B|A] is defined by 


a P[AB] . 
P(B|A| = —— f P[A 1.6-2 
B4)2 aE PLAl> 0. (1.6-2) 
and is read as “the probability that event B occurs given that event A has occurred.” 
Similarly we have 


A P{AB] 


PIAIB| = 


if P[B] > 0. (1.6-3) 


Definitions 1.6-2 and 1.6-3 can be used to compute the joint probability of AB since 
P{AB] = P[A|B]P[B] 
= P|B\A|P[AI. 


Independence. 


Definitions (independence of events) (i) Two events A €.% Be. with P[A] > 0, 
P|B] > 0 are said to be independent if and only if (iff) 


P(AB] = P[A]P[B]. (1.6-4) 


Since, in general, P[AB] = P[B|A|P[A] = P[A|B]P[B] it follows that for independent 
events 
P[A|B] = PA], (1.6-5a) 


P[B|A] = PB). (1.6-5b) 


Thus, the definition satisfies our intuition: If A and B are independent, the outcome B 
should have no effect on the conditional probability of A and vice versa. 

(ii) Three events A, B, C defined on and having nonzero probabilities are said to be 
jointly independent iff 


P[ABC] = P[A]P[B]P[C], (1.6-6a) 
P[AB] = P[A|P[B], (1.6-6b) 
P[AC] = P[A]P[C], (1.6-6c) 
P(BC] = P[B]PIC]. (1.6-6d) 


This is an extension of (i) above and suggests the pattern for the definition of n independent 
events A;,...,A,. Note that it is ‘not sufficient’ to have just P[ABC] = P[A]P[B|P[C]. 
Pairwise independence must also be shown. 
(iii) Let A;, 7 = 1,...,n, be n events contained in Z The {A;} are said to be jointly 
independent iff 

P[A;A5] = P[Ai}P[A5] 
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P[A;A; Ag] = P[A,JP[Aj]P[Ag] 


P[Ay.... An] = P[Ai|P[Ao] ... P[An] 


for all combination of indices such that l<i<j<k<...<n I 


Example 1.6-1 
(Sic bo) The game Sic bo is played in gambling casinos. Players bet on the outcome of a 
throw of three dice. Many bets are possible each with a different payoff. We list two of them 
below with the associated payoffs in parentheses: 


(1) Specified three of a kind (180 to 1), that is, pre-specified by the bettor; 
(2) Unspecified three of a kind (30 to 1), that is, any three-way match. 


What are the associated probabilities of winning from the bettor’s point of view and 
his expected gain. 


Solution 


(1) (specified three of a kind) Let E; be the event that the specified outcome appears on 
the ith toss. Then the event that three of a kind appear is EF £2E3 with probability 
P(E, £2E3] = P| E,]P[E2]P[E3] = 1/216, where we have used the fact that the three 
events are independent since they refer to different tosses. A fair payout would thus 
be 216 to 1, not 180 to 1. 

(2) (unspecified three of a kind) On the first throw any number can come up. On the 
next two throws, numbers that match the first throw must come up. Hence P[three 
unspecified] = 1 x 1/6 x 1/6 = 1/36. A fair payout is thus 36 to 1, not 30 to 1. 


Example 1.6-2 
(testing three events for independence) An urn contains 10 numbered black balls (some 
even, some odd) and 20 numbered white balls (some even, some odd). Some of the balls 
of each color are lighter in weight than the others. The exact composition of the urn is 
shown in the tree diagram of Figure 1.6-1. The outcomes are triples ¢ =(color, weight, 
number). The sample space 2 is the collection of all these triples. Each draw is completely 
random. 

Let A denote the event of picking a black ball, B denote the event of picking a light 
ball, and C' denote the event of picking an even-numbered ball. Are A, B, C’ independent 
events? 


Solution We first test whether P[ABC] = P[A]P[B|P[C]. Now P[A] = 1/3 since 1/3 of 
the balls are black, P[B] = 1/2 since from the tree diagram we see that 15/30ths of the 
balls are light, and P[C] = 2/5 since 12/30 balls are even numbered. Now P[ABC] = 2/30 
since the event ABC is black, light, and even and there are only two of them. Multiplying 
out we find that P[ABC] = P[A]P[|B]P[C]. So the three events pass this part of the test 
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30 balls 
10 black 20 white 
5 light 5 heavy 10 light 10 heavy 
2 even 3 odd 5 odd 0 even 6 even 4 odd | 4 even 6 odd | 


Figure 1.6-1 Diagram of composition of urn. 
for independence. However, for (full) independence, we must also have P[AB] = P|A]P[B], 
P{AC] = P[A]P[C], and P[BC] = P[B]P[C]. Note that P[AC] = 2/30 while P[A|P[C] = 
1/3 x 12/30 = 2/15 4 2/30. Hence A, B, and C are not jointly independent. 


Compound Experiments 


Often we need to consider compound experiments or repeated trials. If we have a probability 
space defined for the individual experiments, we would like to see what this implies for the 
complete or compound experiment. There are two cases to consider, to model the physical 
fact that often the repeated trials seem to be independent of one another, while in other 
important cases the outcome seems to depend on the prior outcomes of earlier trials. 


Independent experiments. Consider two independent experiments, meaning that the 
outcome of one is not affected by past, present, or future outcomes of the other. Let each 
have its own sample space 2, outcomes ¢, events F, and probability measure P. Specifically, 
we have 


¢, € Fy CQ with measure P; and ¢, € Ey C Qe with measure Pr, 


as illustrated in Figure 1.6-2. 
We want to be able to work with compound experiments, meaning that the sample 
space of the compound experiment is the Cartesian product of the two sample spaces, 


G2 GO, xO, 


with vector outcomes (elements) € = (¢1,¢5) €E EC. 


Example 1.6-3 
(flip two coins) Let two experiments each consist of flipping a two-sided coin, with the two 
sides denoted H and T. Then we have 0; = {H,T} = Q2. In the compound experiment, 
we have 2. = {(H,H), (H,T), (T, H), (T, T)}. We could also just as well write the outcomes 


¢ €Q as strings of characters H and T rather than vectors. In that notation, we have 
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O25 


P, 
Figure 1.6-2 Two compound probabilistic experiments. 


Q = {HH,HT,TH, TT}. Considering event E, = {T} in the first experiment, and event 
E 2 = {H} in the second experiment, we have the event FE = {TH} = E; x FE, C 2 in the 
compound experiment. 

When we write a set with cross-product notation, we mean 


Ey x Ey 2 {6 = (61, Gi) |i € Ea and ¢, € Ex}. 


So the elements in the cross-product of two sets are all the possible ordered pairs of elements, 
one from each set. 


Example 1.6-4 
(toss two dice) Let the two experiments now each consist of tossing a die, with the six 
faces (up) being denoted as outcomes 1-6. Then we have Q; = {1,2,3,4,5,6} = Q2. In 
the compound experiment, we have as outcomes the pair (or vector) elements of the cross- 
product sample space 0 = {11,12,...,16,21,...,26,...,61,...,66} = 01 x Qe. Note that now 
all events (subsets of 2) are not of the form FE x E. In fact this is a special case. Consider 
the event {11,12,31}, for example. It is missing the outcome 32 contained in {1,3} x {1,2}. 
However, we can write this event as a disjoint union over set cross products 


{11, 12,31} = {{1} x {1,2}} U {{3} x {ay}. 


Often we are interested in joint models for physical experiments that are independent of 
each other. This requires a definition. Thus, we define mathematically that two compound 
experiments are independent if the probabilities of events E can be expressed in terms of 
the individual probability measures P, and Pp. 


Definition 1.6-1 | Two experiments are said to be independent if (i) for a cross- 
product event E = E, x E2, we can write 


PIE] = P,[E1]P2[E2], 
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(ii) the probability of a general event E in the compound experiment, can be written, in 
terms of singleton events, as 
A 


PIE] Dd Puls HPel{¢o}- 


(61 :02)€E 


We can generalize this concept to combining n experiments to get the compound exper- 
iment’s sample space 


22 @Q; 
=O) x D2 X Ng X... X On, 


and vector (string) outcomes ¢ = (¢),...,¢,) € EC Q, the compound experiment’s sample 
space. [ff 


Example 1.6-5 
(three experiments) Consider three independent experiments, each with its own sample 
space 0,7 = 1,2,3.. Let E; be any arbitrary event in ;. Then the general cross-product 
events E = E, x Ey x Es in the compound experiment would have probabilities 


P[E, x E2 x E3) = P,[E,] P2[E2]P3[Es3], 


where, the events E; would be made up from unions and intersections of the measurable 
subsets of Q;. 


Example 1.6-6 
(repeated coin flips) Consider flipping a coin n times. Each flip can be considered a random, 
independent, experiment. Let the individual outcomes in each experiment be denoted H 
and T then the outcomes in the compound experiment are strings of H and T' of length n. 
There are 2” distinguishable ordered strings. The probability of a string having k H and 
n—k T is given by 


PCr Sn) = IT Plo} 


where p and q = lhe p, with 0 < p < 1, are the individual probabilities of H and T, 
respectively, on a single coin flip. 


We can also express these compound probabilities in terms of general events rather than 
singleton events. Again consider two experiments with probability spaces 


¢, € Fy CQ with measure P; and ¢, € Ey C Qe with measure P>; 
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then the compound experiment consists of the probability space 


€ 2 (¢,,¢4) € EC O with measure P, 


where the compound probability measure P is defined for event E Cc ( as follows. First, we 
must write the compound event in FE as a disjoint union of cross-product events from the 


two experiments 
k 


B= U Ey Xx £2, 
i=1 
for some positive integer k, where /,,; and Fy; are events in Q; and Qs, respectively. In 
the simplest case FE will itself be a cross-product event, and we will have k = 1, but as 
we have seen in Example 1.6-4, it will generally be necessary to take the union of several 
cross-product events to express an arbitrary event FE in the compound sample space. 


Definition 1.6-2 (alternative) Then when we say that the experiments are indepen- 
dent, we mean that for any event EF in the compound experiment, 
k k 
PIE) = x Py [Ey i] Pe [E> i], where B= EY i x Eo i, 
i=1 i=1 
a disjoint union, and where the £;,; and E2,; are events in Q, and Qs, respectively. Here k 
is the number of cross-product events necessary to express compound event E. 

We note that additivity of probability is appropriate since the events are disjoint. We 
can see immediately that this alternative definition is consistent with the definition in 
terms of elementary or singleton events given above. To see this simply take LE); and 
Ep», as singleton events. Clearly this more general approach can also be extended to n > 
2 experiments straightforwardly. We next turn to the more complicated case of multiple 
dependent experiments. 


Dependent experiments.* Consider two “dependent experiments,” meaning that the 
second experiment’s probabilities will depend on the event that occurs in the first exper- 
iment. Let’s say the first experiment consists of outcomes ¢,;, where i = 1,...,k, whose 
probabilities P,|[{¢,;}] are given. The probability measures for the second experiment must 
be parametrized with index 7 from the first experiment, that is, 

P2i[ Eo] for each event Eg» Cc Qo, 


where {22 is the sample space for the second experiment. This is illustrated in Figure 1.6-3. 
Then we write the probability measure for the compound experiment as follows. 


Definition 1.6-3 (dependent experiments) Let E, = {¢,,;} be a singleton event in Q) 
for some 7, and let Ez be an event in Q2; then consider the cross-product event EF = E) x EF», 
in the compound experiment. We then write 


PLE] 2 Pil{Cs,}]Poa[Eal, 


where the probability measure in the second experiment is a function of the outcome in the 
first experiment. [yj 


*Starred material can be omitted on a first reading. 
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82; 


Figure 1.6-3 Two dependent compound experiments. 


First we note that this definition is consistent with the definition above for the case of 
independent experiments. This is because in the case of independent events all the P2; are 
the same, that is, P2,; = Po, for all 2. 

More generally, let the event in the first experiment be FE; = U,{¢, ;}, that is, the union 
of i elementary (singleton) events; then the probability of the compound event E = Ey x E2 
is written as 


PIB) = S7 PillCr a} P2alBl. 


Here additivity makes sense since only one of the i elementary events {¢, ;} can occur in 
the first experiment. 


Example 1.6-7 
(flip biased coins) Let there be three biased coins considered. We flip the first one, with 
p, = P,{{H}]. Depending on the outcome, H or T, we then flip coin 2 or coin 3, respectively. 
Assume for coin 2 that the probability po = P:[{H}], and that for coin 3, we have ps = 
P3({H})]. Here, of course, we assume that all the p; satisfy 0 < p; < 1. Then for po 4 p3, we 
have the case of dependent experiments. Computing, for example, P{HT}, we get pi(1— pe) 
and for P{TH}, we get (1 — p1)ps, etc. 


Example 1.6-8 
(conditioning on events) Consider that the weather today can be sunny, cloudy, or rainy with 
probabilities p1,;,p1,c, and p1,,, respectively, where these three sum to one. Then tomorrow, 
it may be also sunny, cloudy, or rainy, and that may depend on what happened today. So the 
conditional probability for the weather tomorrow can depend on these conditioning events, 
and would be expected to be a different measure for each one. We would have a set of three 
conditional probability measures for day 2, one for each condition from day 1. 
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Relation to conditional probability. Consider a compound experiment with two compo- 
nent experiments (Q1,7,,P,) and (Q2,F2, Pz) that are independent, so that we have the 
compound experiment (Q,F,P) with 

PIE, x E>] = P, [E41] P2[E4] 


o] 


for all cross-product events Ey, x Ep € F, where EF; € Fy, and Ey € Fe. We can think of the 
first experiment as occurring before the second one. Let the conditioning event B € F be 
of the form B = B, x Q2, where B, € F;. Then P[B] = P;[B,]- 1. Similarly, let the event 
A € F be of the form A = 9, x Ag, where Ay € Fo; then P[A] = 1 - P2[Ao], and we find 
that the conditional probability 


v 
is 
& 


P[A|B] = 


= P2[Ap9], 


where we have noted that 
(Q1 x Ag) NM (By x Q2) = {(C1, Ca) Co © A2k ON {(615 Ca)IC1 © Bi} 
= By x A». 


Now this is what we expect to happen for two independent experiments. But, what 
happens when the two experiments are dependent? 


*Example 1.6-9 
(dependent case) Consider a compound experiment with two components as above, that is, 
B= B, x Q2 and A = Q, x Ag, but now assume that these experiments are dependent. 
Assume the number of outcomes in the first experiment to be a finite number k and write 
the probability measure of the second experiment as a function of the outcome on the first 
experiment, that is, P2; for each outcome ¢,; € 91 for i = 1,...,k. Assume also that 
By = {¢1,} for some value i. Then proceeding as in the last example, we have 


P[(Q, x Ag) N (Bi x Q2)] 
P, [By] 

P, [Bi] P2,:[Ae] 

PLB) 


P[A|B] = 


= P, [Ag], as expected. 
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Example 1.6-10 
(communication channel and source) In a binary communication system, we have a binary 
source S along with a binary channel C’ (Figure 1.6-4) defined in terms of its conditional 
probabilities. The sample space 2 for this combined experiment is Q = {¢ = (x,y): 7 = 
and y = 0 or 1} = {(0,0), (0,1), (1,0), (1,1)}, where x denotes the source output that is 
the channel input, and y denotes the channel output. The joint probability function is then 
given as P[{(x,y)}] = Ps[{x}]Po[{y}|{x}] x,y = 0,1, where Pg is the probability measure 
of the source S and Pc is the conditional probability measure of the channel C. 

Because of noise a transmitted zero sometimes gets decoded as a received one and vice 
versa. From repeated use of the channel, it is known that 


Pol{O}|{O}] = 0.9, Pol{1}|{0}] = 0.1, 
Pol{O}l{1}] =0.1, — Pol{i}l{1}] = 0.9, 


and by design of the source Ps[{0}] = Ps[{1}] = 0.5.1 The various probabilities of the 
singleton events in the joint experiment are then 


P[{(0, 0)}] = Pol{O}|{O}] Ps[{O}] = 0.45 
PL{(0, D}] = Pol{1}I{O}] Ps[{0}] = 0.05 
PI{(1, 0)}] = Pol{OFl{1}] Ps[{1}] = 0.05 
PUL, D}] = Polf {1 Ps[{1}] = 0.45. 


We can also define some events on the compound or combined sample space 
Xo £ “event that « = 0” and X, 4 “event that a= 1” 


“event that y = 0” and Yj “event that p= 1? 


Yo 
and rewrite the above channel conditional probabilities as 
P[Yo|Xo] =0.9 and P[Yi|Xo] = 0.1 


P[Yo|X1] = 0.1 and PLY |X4] = 0.9. 


Cy: 


source 


binary channel 


Figure 1.6-4 A binary communication system. 


TIt is good practice to design a code in which the zeros and ones appear at close to the same rate since 
this puts the signaling capacity of the channel to greatest use. 
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The source probabilities are then just expressed as 
P{|Xo] =0.5 and P[Xi] = 0.5. 


In the combined experiment, the above joint probabilities become 


P[Xo U Yo] = P[Yo|Xo]P[Xo] = 0.45 
P[Xo UYi] = P[¥i|Xo]P[Xo] = 0.05 
P[X1U Yo] = P[Yo|X1]P[X1] = 0.05 
P[X,U Yi] = P[Yi|Xi]P[Xo] = 0.45. 


The introduction of conditional probabilities raises the important question of whether condi- 
tional probabilities satisfy Axioms 1 to 3. In other words, given any two events E, F such 
that EF = ¢ and a third arbitrary event A with P[A] > 0, all belonging to the o-field of 
events .¥in the probability space (Q,.4% P), does 

P|E|A] > 0? 

P[Q\|A] = 1? 

P|EUF|A] = P[E|A]+ P[F|A] for EF = ¢? 


The answer is yes. We leave the details as an exercise to the reader. They follow directly 
from the definition of conditional probability and the three Kolmogorov axioms. 


Example 1.6-11 
(probability trees) Three events A, B, and C are often specified in terms of conditional 
probabilities as follows: 


P[A], P[B|A], P[B|A°] and 
P[C|BA], P[C|BA®], P[C|B° A], P[C|B° A‘). 


In such a case the problem can be summarized in a tree diagram, such as Figure 1.6-5, where 
the branches are labeled with the relevant conditional probabilities and the node values are 
the corresponding joint probabilities. Here the root node can be thought of as having value 
1.0 and being associated with the certain event . If we want to evaluate the probability 
of an event on a leaf (the last set of nodes) of the tree, we just multiply the conditional 
probabilities on its path. 

A way this can arise is if the events come from compound experiments conducted 
sequentially, so that the event B depends on the event A, and in turn the event C depends on 
them both. A more general tree would have more than two outgoing branches at each node 
indicating more than two events were possible, for example, A;, Ag,..., Ay. The conditional 
probabilities can be stored in a data structure in a machine, which could be queried for 
answers to various joint probability questions, such as: What is the probability of the joint 
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P[CBA] 


P[C'BA] 


PICB‘A] 
PICBA] 
P[CBA‘] 


P[CBA ] 


PICBA4 


PICBA) 


Figure 1.6-5 A probability tree diagram with conditional probabilities on the branches and joint 
probabilities at the nodes. 


event C;,B,A,, which could be answered by tracing the corresponding path in the stored 
data structure and then multiplying the values on its branches? For a concrete example, 
take first round A, events to indicate the health (good, fair, poor) of a plant purchased 
at a local nursery, then B; can indicate its health one week later, and Ci, can indicate the 
health at two weeks from purchase. 


The next example, illustrating the use of joint and conditional probabilities, has appli- 
cations in real life where we might be forced to make important decisions without knowing 
all the facts. 


Example 1.6-12 
(beauty contest)' Assume that a beauty contest is being judged by the following rules: 
(1) There are N contestants not seen by the judges before the contest, and (2) the contestants 
are individually presented to the judges in a random sequence. Only one contestant appears 
before the judges at any one time. (3) The judges must decide on the spot whether the 
contestant appearing before them is the most beautiful. If they decide in the affirmative, 
the contest is over but the risk is that a still more beautiful contestant is in the group as yet 
not displayed. In that case the judges would have made the wrong decision. On the other 
hand, if they pass over the candidate, the contestant is disqualified from further considera- 
tion even if it turns out that all subsequent contestants are less beautiful. What is a good 


+ Thanks are due to Geof Williamson and Jerry Tiemann for valuable discussions regarding this problem. 
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1 2 BZ eee a eee X-1 x eee N Draw number n 


Position 
of highest 
number 


Figure 1.6-6 The numbers along the axis represent the chronology of the draw, not the number 
actually drawn from the bag. 


strategy to follow to increase the probability of picking the most beautiful contestant over 
that of a random choice? 


Solution To make the problem somewhat more quantitative, assume that all the virtues 
of each contestant are summarized into a single “beauty” number. Thus, the most beautiful 
contestant is associated with the highest number, and the least beautiful has the lowest 
number. We make no assumptions regarding the distribution or chronology of appearance 
of the numbers. The numbers, unseen by the judges, are placed in a bag and the numbers are 
drawn individually from the bag. We model the problem then as one of randomly drawing 
the “beauty” numbers from a bag. We consider that the draws are ordered along a line as 
shown in Figure 1.6-6. Thus, the first draw is number 1, the second is 2, and so forth. At 
each draw, a number appears. Is it the largest of all the N numbers? 

Assume that the following “wait-and-see” strategy is adopted: We pass over the first k 
draws (i.e., we reject the first & contestants) but record the highest number (i.e., the most 
beautiful contestant) observed within this group of k. Then we continue drawing numbers 
(i.e., call for more contestants to appear). The first draw (contestant) after the k passed-over 
draws that yields a number exceeding the largest number from the first k draws is taken to 
be the winner. If a larger number does not occur, then the judge declines to vote and we 
count this as an error. 

Let us define E(k) as the event that the largest number that is drawn from the first 7 
draws occurs in the group of first k draws. Then for 7 < k, E;(k) = (the certain event), 
but for 7 > k, Ej(k) will be a proper subset of 2. Let x denote the draw that will contain 
the largest number among the N numbers in the bag. Then two events must occur jointly 
for the correct decision to be realized. (1) (obvious) {a > k}; and (2) (subtle) E;(k) for all 
j such that k <j < x. Then for a correct decision C to happen, the subevent {x = j + 1} 
must occur jointly with the event E(k) for each j such that k < 7 < N. The event {x > k} 
can be resolved into disjoint subevents as 


{e>k}={t@=k4+1} U{e=k+2}U...U{e@=N}. 
Thus, 


C={e=k4+1,Ex(k)}U{e=k4 2, Exsi(k)}...U {2 = N, Ey_i(k)}, 
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and the probability of a correct decision is 


where we have used the fact that P[x = 7 + 1] = + since all N draws are equally likely to 
result in the largest number. Also P[E;(k)|2 = j+1] = a since the “largest” draw from the 
first 7 draws could equally likely be any of the first 7 draws, and so the probability that it 
is in the first k of these 7 draws is given by the fraction Z 


By the Euler summation formula’ for large N 


l2 


k; 
N 
N 
2 / s for k large enough, 
N J, 
k; 
N 


Neglecting the integer constraint, an approximate best choice of k, say ky, can be found by 
differentiation. Setting 
dP[C] _ 0 
ie 


we find that 
N 


ko & 
Invoking the integer constraint we round ko to the nearest integer, as to finally obtain 


N 1 
koe }|—+H= 
0 Pari 


+See, for example, G. F. Carrier et al., Functions of a Complex Variable (New York: McGraw-Hill, 
1966), p. 246, or visit the Wikipedia page: Euler-Maclaurin formula  (http://en.wikipedia.org/wiki/ 
Euler%E2%80%93Maclaurin-_formula). 


34 Chapter 1 Introduction to Probability 


where |-| denotes the least-integer function. The maximum probability of a correct decision 
P{[C] then becomes 


l2 


PIC] 
1 : 
~ —Ine = 0.367. 
e€ 


Thus, we should let approximately the first third (more precisely 36.7 percent) of the contes- 
tants pass by before beginning to judge the contestants in earnest. We assume that N is 
reasonably large for this result to hold. The interesting fact is that the result is indepen- 
dent of (large) N while the probability of picking the most beautiful candidate by random 
selection decreases as 1/N. 


Here are some other situations that require a strategy that will maximize the probability 
of making the right decision. 


1. You are apartment-hunting and have selected 30 rent-controlled flats to inspect. You 
see an apartment that you like but you are not ready to make an offer because you 
think that the next apartment to be shown might be more desirable. However, none 
of the subsequent apartments that you visit measure up to the first. Sadly, your offer 
for that apartment is rejected because, meanwhile, someone else rented it. You will 
have to settle for a far lesser desirable apartment because you hesitated. 

2. You are looking for a partner to spend the rest of your life with. To that end you 
contract with a singles dating agency to meet 50 possible life partners at the rate of 
one date per week. On your ninth date, you decide that you have found your life’s 
partner and offer marriage, which is accepted. However, you forget to tell the dating 
agency to stop introducing you to additional partners. The following week you are 
introduced to a date that in all qualities surpasses your chosen one. You kick yourself 
for having acted too impulsively. 

3. You are interviewing candidates for a high-level position in the government. To reduce 
the possibility of discrimination on your part you are bound by the following rules: 
You are to interview the candidates in sequence and offer the job to the first candidate 
who is qualified according to the job description. If you reject a candidate it means 
that he/she was not qualified and so you must state in writing in your report. However, 
you are savvy enough to know that even among the qualified candidates there will be 
those that are superbly qualified while others will be merely qualified. You want to 
hire the best person for the job. What should your strategy be? 


Total Probability. In many problems in engineering and science we would like to compute 
the unconditional probability P[B] of an event B in terms of the sum of weighted 
conditional probabilities. Such a computation is easily realized through the following 
theorem. 
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Theorem 1.6-1 Let Aj, Ao,...,An be nm mutually exclusive events such that 
Us, Ai = Q (the A;’s are exhaustive). Let B be any event defined over the probability 
space of the A;’s. Then, with P[A;] 4 0 all 4, 


P|B] = P[B\Ay)P[Ai] +...+ P[B|Ay]P[Ap]. (1.6-7) 
Sometimes P[B] is called the total probability of B because the expression on the right is a 
weighted average of the conditional probabilities of B. 


Proof We have A;A; = ¢ for alli 4 j and U;_, Ai =. Also BD = B= BU}, Ai = 
U;_, BAj. But by definition of the intersection operation, BA; C Aj; hence (BA;)(BA;) = 
for alli 4 j. Thus, from Axiom 3 (generalized to n events): 


P[B]) =P U BA;| = P[BA,] + P[BA2]+...+ P[BAp] 
= P[B|A,]P[Ai] +... + P[B|An]P[Ay]- (1.6-8) 


The last line follows from Equation 1.6-2. 


Example 1.6-13 
(more on binary channel) For the binary communication system shown in Figure 1.6-4, 
compute the unconditional output probabilities P[Yo] and P[Yi]. 


Solution Continuing with the notation of binary communication Example 1.6-10, we use 
Equation 1.6-8 as follows: 


P[Yo] = P[Yo|Xo]P[Xo] + P[Yo|X1)P[X1] 
= Po[0|0] Ps [0] + Pe [0|1]Ps[1]" 
= (0.9)(0.5) + (0.1)(0.5) 
=05, 


We can compute P[Y;] in a similar fashion or by noting that YoU Y; = Q and Yon Yi = 4; 
that is, they are disjoint. Hence P[Yo] + P[Yi] =1, implying P[Y;] = 1 — P[Yo] = 0.5. 


1.7 BAYES’ THEOREM AND APPLICATIONS 


The previous results enable us now to write a fairly simple formula known as Bayes’ 
theorem. Despite its simplicity, this formula is widely used in biometrics, epidemiology, 
and communication theory. 


+ For notational ease, we have abbreviated these terms by leaving off the curly brackets. We retain the 
square brackets for probabilities P through to remind that they are set functions. 
Named after Thomas Bayes, English mathematician/philosopher (1702-1761). 
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Bayes’ Theorem Let A;, i = 1,...,n, be a set of disjoint and exhaustive events 
defined on a probability space Y Then, Uj, Ai = 2, AiA; = ¢ for all i 4 j. With B any 
event defined on Awith P[B] > 0 and P[A;] 4 0 for all 7 


__PIBIAIPIAg 
¥> PIBIAIPIAI 


w=1 


P[Aj|B] (1.7-1) 


Proof The denominator is simply P[|B] by Equation 1.6-8 and the numerator is simply 
P|A;B]. Thus, Bayes’ theorem is merely an application of the definition of conditional 
probability. 


Remark In practice the terms in Equation 1.7-1 are given various names: P[A,|B] 
is known as the a posteriori probability of A; given B; P[B|A,] is called the a priori 
probability of B given A;; and P[Aj] is the causal or a priori probability of A;. In general 
a priori probabilities are estimated from past measurements or presupposed by experience 
while a posteriori probabilities are measured or computed from observations. 


Example 1.7-1 
(inverse binary channel) In a communication system a zero or one is transmitted with 


Ps/0] = po, Ps{1] = 1 — po 4 pi, respectively. Due to noise in the channel, a zero can be 
received as a one with probability 6, called the cross-over probability, and a one can be 
received as a zero also with probability @. A one is observed at the output of the channel. 
What is the probability that a one was output by the source and input to the channel, that 
is, transmitted? 


Solution The structure of the channel is shown in Figure 1.7-1. We write 


ne Par (1.7.2) 
= Poll Ps[1] 
~ Po[1|1]Ps[1] + Pe[1]0]Ps[0] (1.7-3) 
= 2 PE) 
~ pil — B) + po (1.7-4) 
x=0, Py 0.9 _0 
0.1 
ateP y= 


Figure 1.7-1 Representation of a binary communication channel subject to noise. 
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If p = pi = 5, the inverse or a posteriori probability P[X,|Y,] depends on ( as shown in 
Figure 1.7-2. The channel is said to be noiseless if 3 = 0, but notice that the channel is just 
as useful when 3 = 1. Just invert the outputs in this case! 


Example 1.7-2 
(amyloid test for Alzheimer’s disease) On August 10, 2010 there was a story on network 
television news that a promising new test was developed for Alzheimer’s disease. It was 
based on the occurrence of the protein amyloid in the spinal (and cerebral) fluid, which 
could be detected via a spinal tap. It was reported that among Alzheimer’s patients (65 
and older) there were 90 percent who had amyloid protein, while among the Alzheimer’s 
free group (65 and older) amyloid was present in only 36 percent of this subpopulation. 
Now the general incidence of Alzheimer’s among the group 65 and older is thought to be 
10 percent from various surveys over the years. From this data, we want to find out: Is it 
really a good test? 

First we construct the probability space for this experiment. We set 2 = {00,01, 10, 11} 
with four outcomes: 


00 = “no amyloid” and “no Alzheimer’s,” 
01 = “no amyloid” and “Alzheimer’s,” 
10 = “amyloid” and “no Alzheimer’s.” 


11 = “amyloid” and “Alzheimer’s.” 


On this sample space, we define two events: A 4 {10,11} = “amyloid” and 
BA {01,11}= “Alzheimer’s.” From the data above we have 


P[A|B]=0.9 and P[A|B] = 0.36. 


PIX, 1Y;] 


0 1 B 


Figure 1.7-2. A posteriori probability versus (. 
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Also from the general population (65 and greater), we know 
P[B])=0.1 and P[B]=1— PB] =09. 


Now to determine if the test is good, we must look at the situation after we give the test, 
and this is modeled by the conditional probabilities after the test. They are either P|-| A], 
if the test is positive for amyloid, or P[-|A‘], if the test is negative. So, we can use total 
probability to find answers such as P[B|A]. We have 


P[A| BI] P[B] 
[A|B] P[B] + P[A|B-]P[B*] 
0.9 x 0.1 
~ 0.9 x 0.1 +0.36 x 0.9 


= 0.217. 


PIBIA\= 5 


So, among the group that tests positive, only about 22 percent will actually have Alzheimer’s. 
The test does not seem so promising now. Why is this? Well, the problem is that we are never 
in the “knowledge state” characterized by event B where conditional probability P[-|B] is 
relevant. Before the test is given, our knowledge state is characterized by the uncondi- 
tional probabilistic knowledge P]-]. After the test, we have knowledge state determined by 
whether event A or A‘ has occurred; that is, our conditional probabilistic state is either 
P|-|A] or P[-|A‘]. You see, we enter into states of knowledge either “given A” or “given A” 
by testing the population. So we are never in the situation or knowledge state where P[-|B] 
or P{-|B°] is the relevant probability measure. So the given information P[A|B] = 0.9 and 
P[A|B‘] = 0.36 is not helpful to directly decide whether the test is useful or not. This is 
the logical fallacy of reasoning with P[A|B] instead of P[B|A], but there is another very 
practical thing going on here too in this particular example. 

When we calculate P[B°|A] = 1.0 — 0.217 = 0.783, this means that about 78 percent of 
those with positive amyloid tests do not have Alzheimer’s. So the test is not useful due to 
its high false-positive rate. Again, as in the previous example, the scarcity of Alzheimer’s 
in the general population (65 and greater) is a problem here, and any test will have to 
overcome this in order to become a useful test. 


1.8 COMBINATORICS! 


Before proceeding with our study of basic probability, we introduce a number of counting 
formulas important for counting equiprobable events. Some of the results presented here 
will have immediate application in Section 1.9; others will be useful later. 


}This material closely follows that of William Feller [1-8]. 
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A population of size n will be taken to mean a collection (set) of n elements without 
regard to order. Two populations are considered different if one contains at least one element 
not contained in the other. A subpopulation of size r from a population of size n is a subset of 
r elements taken from the original population. Likewise, two subpopulations are considered 
different if one has at least one element different from the other. 

Consider a population of n elements a1, d2,...,@,. Any ordered arrangement ax, , Gx, , 
...,@, Of r symbols is called an ordered sample of size r. Consider now the generic urn 
containing n distinguishable numbered balls. Balls are removed one by one. How many 
different ordered samples of size r can be formed? There are two cases: 

(4) Sampling with replacement. Here after each ball is removed, its number is recorded 
and it is returned to the urn. Thus, for the first sample there are n choices, for the second 
there are again n choices, and so on. Thus, we are led to the following result: For a population 
of n elements, there are n” different ordered samples of size r that can be formed with 
replacement. 

(ii) Sampling without replacement. After each ball is removed, it is not available 
anymore for subsequent samples. Thus, n balls are available for the first sample, n — 1 
for the second, and so forth. Thus, we are now led to the result: For a population of n 
elements, there are 

(n), 2 n(n —D(n—2)...(n—r 41) 
n! 
a aaa (1.8-1) 
different ordered samples of size r that can be formed without replacement? 

The Number of Subpopulations of Size r in a Population of Sizen. A basic problem 
that often occurs in probability is the following: How many groups, that is, subpopulations, 
of size r can be formed from a population of size n? For example, consider six balls numbered 
1 to 6. How many groups of size 2 can be formed? The following table shows that there are 
15 groups of size 2 that can be formed: 


12 23 34 45 56 
13 24 35 46 


14 25 36 
15 26 
16 


Note that this is different from the number of ordered samples that can be formed 
without replacement. These are (6-5 = 30): 


12 21 31 41 #51 61 
13 23 382 42 52 62 
14 24 34 43 53 63 
15 25 35 45 54 64 
16 26 36 46 56 65 


+Different samples will often contain the same subpopulation but with a different ordering. For this 
reason we sometimes speak of (n), ordered samples that can be formed without replacement. 
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Also it is different from the number of samples that can be formed with replacement 
(6? = 36): 
11 21 31 41 #51 61 
12 22 32 42 52 62 
13 23 333) 485 C68 
14 24 34 44 54 64 
15 25 35 45 55 65 
16 26 36 46 56 66 


A general formula for the number of subpopulations, C?’ of size r in a population of 
size n can be computed as follows: Consider an urn with n distinguishable balls. We already 
know that the number of ordered samples of size r that can be formed is (n),. Now consider 
a specific subpopulation of size r. For this subpopulation there are r! arrangements and 
therefore r! different ordered samples. Thus, for C}’ subpopulations there must be C7’ - r! 
different ordered samples of size r. Hence 


or 


Gia i (1.8-2) 


r! (n—r)!r! ~\ Pp 


Equation 1.8-2 is an important result, and we shall apply it in the next section. The symbol 


o8(t) 
ih 


is called a binomial coefficient. Clearly 


("") ~ a rth ai : (,.",.) =n Coe 


We already know from Section 1.4 that the total number of subsets of a set of size n is 2”. 


The number of subsets of size r is ic) Hence we obtain that 
ss (") <n, 
7 
r=0 


A result which can be viewed as an extension of the binomial coefficient Cy’ is given by the 
following. 


Theorem 1.8-1 Let r,,...,77 be a set | of nonnegative integers such that ry +ro+... 
+r; =n. Then the number of ways in which a population of n elements can be partitioned 
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into | subpopulations of which the first contains r; elements, the second rz elements, and 
so forth, is 
= ; (1.8-4) 
Eels arin EPS 
This coefficient is called the multinomial coefficient. Note that the order of the subpopulation 
is essential in the sense that (r1 = 7, ro = 10) and (r, = 10, rg = 7) represent different 
partitions. However, the order within each group does not receive attention. For example, 
suppose we have five distinguishable balls (1,2,3,4,5) and we ask how many subpopulations 
can be made with three balls in the first group and two in the second. Here n = 5, r; = 3, 
ro = 2, and r1 + rg = 5. The answer is 5!/3!2! = 10 and the partitions are 


Group 1:| 1,2,3| 2,3,4 | 3,4,5 | 4,5,1 | 5,1,2 | 2,4,5 | 2,35 | 13,5 | 1,3,4 | 1,2,4 
Group 2:| 4,5 | 5,1 | 1,2 | 23 | 34 | 13 ] 14 | 24 | 25 | 3,5 


Note that the order is important in that had we set r; = 2 and rz = 3 we would have gotten 
a different partition, for example, 


Group 7 45 | 5,1 | i2 | a4 | 3,4 | 13 | 1,4 | 2,4 | 2.5 | 3.5 | 


Group 2: | 1,2,3] 2,3,4 | 34,5 | 4,51 | 5,1,2 | 2,4,5 | 2,3,5 | 13,5 | 1,3,4 | 1,2,4 
The partition (4,5), (1,2,3) is, however, identical with (5,4), (2,1,3). 
Proof Note that we can rewrite Equation 1.8-4 as 

nil 1 
ry! ore! ory! 

1-1 

n— rj |! 
n! (n— 11)! (n—1r, — 12)! j=l 


ri(n—71)! rel(n—71 — re)! r3!(n — 171 — Tg — 73)! l 


Recalling that 0! = 1, we see that the last term is unity. Then the multinomial formula is 
written as 


n! n m—Try n—-T1,—-T2 nm—-Ty TQ wad a) 
———— = eon . (1.8-5 
ry!ro!...r7! (") ( T2 ) ( T3 T=] ( ) 


To affect a realization of r; elements in the first subpopulation, r2 in the second, and 
so on, we would select r; elements from the given n, rg from the remaining n — 71, r3 


from the remaining n — rT; — re, etc. But there are ways of choosing r; elements 
PL 


n—?T ; hak 
out of n, ( . :) ways of choosing rg elements out of the remaining n — 71, and so 
2 
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forth. Thus, the total number of ways of choosing r; from n, rg from mn — 11, and so on is 
simply the product of the factors on the right-hand side of Equation 1.8-5 and the proof is 
complete. 


Example 1.8-1 
(toss 12 dice) [1-8, p. 36] Suppose we throw 12 dice; since each die throw has six outcomes 
there are a total of np = 6!” outcomes. Consider now the event FE that each face appears 
twice. There are, of course, many ways in which this can happen. Two outcomes in which 
this happens are shown below: 


Dice I.D. Number | 1 2 3 4 5 6 7 8 9 10 11 12 
Outcome 1 3 1 3 6 12 5 4 4 6 2 =«5 
Outcome 2 6 1 3 2 6 3 4 5 5 1 4 2 


The total number of ways that this event can occur is the number of ways 12 dice (n = 12) 
can be arranged into six groups (k = 6) of two each (71 = rg =... = 76 = 2). Assuming 
that all outcomes are equally likely we compute 


mp number of ways F can occur 


np total number of outcomes 
12! 
= (ane = 0.003438. 


The binomial and multinomial coefficients appear in the binomial and multinomial 
probability laws discussed in the next sections. The multinomial coefficient is also important 
in a class of problems called occupancy problems that occur in theoretical physics. 


Occupancy Problems* 


Occupancy problems are generically modeled as the random placement of r balls into n 
cells. For the first ball there are n choices, for the second ball there are n choices, and so 
on, so that there are n” possible distributions of r balls in n cells and each has a probability 
of n~". If the balls are distinguishable, then each of the distributions is distinguishable; if 
the balls are not distinguishable, then there are fewer than n” distinguishable distributions. 
For example, with three distinguishable balls (r = 3) labeled “1,” “2,” “3” and two cells 
(n = 2), we get eight (2%) distinguishable distributions: 


Cellno.1 | 1 | 2 | 3 | 12/13 [23 [1,2,3 | — 
Cellno.2 [23/13/12] 3 [2/1 {[— [123] 


When the balls are not distinguishable (each ball is represented by a“*”), we obtain 
four distinct distributions: 


Cell no. 1 | #%* | x | * | =| 
Cell no. 2 | — | * J ee | eee | 
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How many distinguishable distributions can be formed from r balls and n cells? An 
elegant way to compute this is furnished by William Feller [1-8, p. 38] using a clever artifice. 
This artifice consists of representing the n cells by the spaces between n+ 1 bars and the 
balls by stars. Thus, 

| | | | 


represents three empty cells, while 


represents two balls in the first cell, zero balls in the second and third cells, one in the 
fourth, two in the fifth, and so on. Indeed, with r; > 0 representing the number of balls in 
the ith cell and r being the total number of balls, it follows that 


Tyee. ly ST 


The n-tuple (r1,72,...,1n) is called the occupancy and the r; are the occupancy numbers; 
two distributions of balls in cells are said to be indistinguishable if their corresponding 
occupancies are identical. The occupancy of 


is (2,0,0,1,2,0,5). Note that n cells require n + 1 bars but since the first and last symbols 
must be bars, only n — 1 bars and r stars can appear in any order. Thus, we are asking for 
the number of subpopulations of size r in a population of size n —1+r. The result is, by 


Equation 1.8-2, 
nm+r—-1l\ (n+r-1 
( : )=( ve » (1.8-6) 
Example 1.8-2 


(distinguishable distributions) Show that the number of distinguishable distributions in 


which no cell remains empty is s : ), Here we require that no bars be adjacent. There- 


fore, n of the r stars must occupy spaces between the bars but the remaining r — n stars 
can go anywhere. Thus, n — 1 bars and r—n stars can appear in any order. The number of 
distinct distributions is then equal to the number of ways of choosing r—1n places in (n—1) 
bars +(r — n) stars or r—n out of n—1+r—n=r-—1. This is, by Equation 1.8-2, 


r-l\ fr-1 
r—-n} \n-1)° 
Example 1.8-3 


(birthdays on same date) Small groups of people are amazed to find that their birthdays 
often coincide with others in the group. Before declaring this a mystery of fate, we analyze 
this situation as an occupancy problem. We want to compute how large a group is needed 
to have a high probability of a birthday collision, that is, at least two people in the group 
having their birthdays on the same date. 
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Solution We let the n (n = 365) days of the year be represented by n cells, and the r 
people in the group be represented by r balls. Then when a ball is placed into a cell, it fixes 
the birthday of the person represented by that ball. A birthday collision occurs when two 
or more balls are in the same cell. Now consider the arrangements of the balls. The first 
ball can go into any of the n cells, but the second ball has only n — 1 cells to choose from 
to avoid a collision. Likewise, the third ball has only n — 2 cells to choose from if a collision 
is to be avoided. Continuing in this fashion, we find that the number of arrangements that 
avoid a collision is n(n — 1)---(m—r-+1). The total number of arrangements of r balls in 
n cells is n™. Hence with Po(r,n) denoting the probability of zero birthday collisions as a 


r—1 


function of r and n, we find that Po(r,n) = M@-V=@="*) _ TT (1 +). Then 1— Po(r,n) 


nr =I 
is the probability of at least one collision. 
How large does r need to be so 1 — Po(r,n) > 0.9 or, equivalently, Po(r,n) < 0.1? 


Except for the mechanics of solving for r in TL (1— +) < 0.1, the problem is over. We use 

a result from elementary calculus that for cal a, 1—a <e~*, which is quite a good 

approximation for x near 0. If we replace Tl (1 - +) < 0.1 by TL e-* < 0.1, we get a 
r-l, et wa 

bound and estimate of r. Since U em =exp{—+ & i} and with use of > i=r(r —1)/2, 


it follows that e728") < 0.1 will give us an estimate of r. Solving for r and assuming 
that r? >> r, and n = 365, we get that r ~ 40. So having 40 people in a group will yield a 
90 percent of (at least) two people having their birthdays on the same day. 


Example 1.8-4 
(treize) In seventeenth-century Venice, during the holiday of Carnivale, gamblers wearing 
the masks of the characters in the commedia dell’arte played the card game treize in enter- 
tainment houses called ridottos. 

In treize, one player acts as the bank and the other players place their bets. The bank 
shuffles the deck, cards face down, and then calls out the names of the cards in order, from 
1 to 13—ace to king—as he turns over one card at a time. If the card that is turned over 
matches the number he calls, then he (the bank) wins and collects all the bets. If the card 
that is turned over does not match the bank’s call, the game continues until the dealer calls 
“thirteen.” If the thirteenth card turned over is not a king, the bank loses the game and 
must pay each of the bettors an amount equal to their bet; in that case the player acting 
as bank must relinquish his position as bank to the player on the right. 

What is the probability that the bank wins? 


Solution We simplify the analysis by assuming that once a card is turned over, and there 
is no match, it is put back into the deck and the deck is reshuffled before the next card is 
dealt, that is, turned over. Let A, denote the event that the bank has a first match, that 
is, a win, on the nth deal and W,, denote the event of a win in n tries. Since there are 4 
cards of each number in a deck of 52 cards, the probability of a match is 1/13. In order for 
a first win on the nth deal there have to be n — 1 non matches followed by a match. The 
probability of this event is 
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1 n—1 1 


Since A;A; = ¢ for i ¥ j, the probability of a win in 13 tries is 


13 pees 
P[Wis] = > PAI = ie 
=I 13 


= 0.647, 


from which it follows that the probability of the event Wf, that the bank loses is P[Wf3] = 
0.353. Actually this result could have been more easily obtained by observing that the bank 
loses if it fails to get a match (no successes) in 13 tries, with success probability 1/13. Hence 


P[Wi3] = @ (4) (2) = 0.353. 


Note that in the second equation we used the sum of the geometric series result: 


N-1 ae 
> 2” = +=*_(cf. Appendix A). 
n=0 

Points to consider. Why does P[A,] — 0 as n — co? Why is P[W,,] > P[A,] for all 
n? How would you remodel this problem if we didn’t make the assumption that the dealt 
card was put back into the deck? How would the problem change if the bank called down 


from 13 (king) to 1 (ace)? 


In statistical mechanics, a six-dimensional space called phase space is defined as a 
space which consists of three position and three momentum coordinates. Because of the 
uncertainty principle which states that the uncertainty in position times the uncertainty 
in momentum cannot be less than Planck’s constant h, phase space is quantized into tiny 
cells of volumes v = h®. In a system that contains atomic or molecular size particles, 
the distribution of these particles among the cells constitutes the state of the system. In 
Maxwell—Boltzmann statistics, all distributions of r particles among n cells are equally likely. 
It can be shown (see, for example, Concepts of Modern Physics by A. Beiser, McGraw-Hill, 
1973) that this leads to the famous Boltzmann law 


we crn ene /AL, (1.8-7) 


where n(e)de is the number of particles with energy between ¢ and « + de, N is the total 
number of particles, T is absolute temperature, and & is the Boltzmann constant. The 
Maxwell—Boltzmann law holds for identical particles that, in some sense, can be distin- 
guished. It is argued that the molecules of a gas are particles of this kind. It is not difficult 
to show that Equation 1.8-7 integrates to N. 

In contrast to the Maxwell-Boltzmann statistics, where all n” arrangements are equally 
likely, Bose-Einstein statistics considers only distinguishable arrangements of indistinguish- 
able identical particles. For n cells and r particles, the number of such arrangements is given 


by Equation 1.8-6 
n+r—-—1 
r ? 
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and each arrangement is assigned a probability 


n+r-1 “= 
7 . 


It is argued that Bose-Einstein statistics are valid for photons, nuclei, and particles of zero 
or integral spin that do not obey the exclusion principle. The exclusion principle, discovered 
by Wolfgang Pauli in 1925, states that for a certain class of particles (e.g., electrons) no two 
particles can exist in the same quantum states (e.g., no two or more balls in the same cell). 

To deal with particles that obey the exclusion principle, a third assignment of proba- 
bilities is construed. This assignment, called Fermi—Dirac statistics, assumes 


(1) the exclusion principle (no two or more balls in the same cell); and 
(2) all distinguishable arrangements satisfying (1) are equally probable. 


Note that for Fermi-Dirac statistics, r <n. The number of distinguishable arrangements 
under the hypothesis of the exclusion principle is the number of subpopulations of size r < n 
in a population of n elements or a Since each is equally likely, the probability of any 
-1 
one state is 
The above discussions should convince the reader of the tremendous importance of 


probability in the basic sciences as well as its limitations: No amount of pure reasoning based 
on probability axioms could have determined which particles obey which probability laws. 


Extensions and Applications 


Theorem 1.5-1 on the probability of a union of events can be used to solve problems of 
engineering interest. First we note that the number of individual probability terms in the 


sum 5; is ): Why? There are a total of n indices and in Sj, all terms have 27 indices. 
For example, with n = 5 and 2 = 2, S» will consist of the sum of the terms P;;, where the 


indices 77 are 12; 13; 14; 15; 23; 24; 25; 34; 35; 45. Each set of indices in S; never repeats, 
that is, they are all different. Thus, the number of indices and, therefore, the number of 


terms in S; is the number of subpopulations of size 7 in a population of size n which is (7) 
from Equation 1.8-2. Note that S,, will have only a single term. 


Example 1.8-5 
We are given r balls and n cells. The balls are indistinguishable and are to be randomly 
distributed among the n cells. Assuming that each arrangement is equally likely, compute 
the probability that all cells are occupied. Note that the balls may represent data packets 
and the cells buffers. Or, the balls may represent air-dropped food rations and the cells, 
people in a country in famine. 


Solution Let EF; denote the event that cell ¢ is empty (i = 1,...,n). Then the r balls 
are placed among the remaining n — 1 cells. For each of the r balls there are n — 1 cells to 
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choose from. Hence there are A(r,n — 1) S (n — 1)” ways of arranging the r balls among 


the n—1 cells. Obviously, since the balls are indistinguishable, not all arrangements will be 


n+r— 
n—- 


distinguishable. Indeed there are only ( 1 . distinguishable distributions and these 


are not, typically, equally likely. The total number of ways of distributing the r balls among 
the n cells is n”. Hence 


to 


P[E\] = (n—1)"/n" = (1 7 aN A 


n 


Next assume that cells 7 and j are empty. Then A(r,n — 2) = (n — 2)" and 


n 


>) Tr 
PEE; Rj = (1-2) | 


In a similar fashion, it is easy to show that PLE; E; Ex] = (1 _ 3)" = ijk, and so on. Note 
that the right-hand side expressions for P;, Pi;, Pijx, and so on do not contain the subscripts 


i, ij, i7k, and so on. Thus, each $; contains : identical terms and their sum amounts to 


s-()O-a 


Let E denote the event that at least one cell is empty. Then by Theorem 1.5-1, 


n 


Us 


i=l 


PIE] =P = S$, —Sot+...+S» 


Substituting for S; from two lines above, we get 


piel=-(") Co (1-4)", (1.8-8) 


The event that all cells are occupied is E°. Hence P[E°] = 1— P[E], which can be written as 


rie = 3 (")en'(1- 5) (1.8-9) 


= 
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Example 1.8-6 
(m cells empty) Use Equation 1.8-9 to compute the probability that exactly m out of the 
n cells are empty after the r balls have been distributed. We denote this probability by the 
three-parameter function P,,(r,7). 


Solution We write P[E°| = Po(r,n). Now assume that exactly m cells are empty and 
n—m cells are occupied. Next, let’s fix the m cells that are empty, for example, cells numbers 
2,4,5,7,...,/. Let B(r,n — m) be the number of ways of distributing r balls among the 
a 


m terms 


remaining n—™m cells such that no cell remains empty and let A(r,—m) denote the number 
of ways of distributing r balls among n—m cells. Then Po(r,n—m) = B(r,n—m)/A(r,n—m) 
and, since A(r,n—m) = (n—m)", we get that B(r,n—m) = (n—m)" Po(r,n—m). There are 


ey ways of placing m empty cells among n cells. Hence the total number of arrangements 


of r balls among n cells such that m remain empty is (”) (n — m)" Po(r,n — m). Finally, 


the number of ways of distributing r balls among n cells is n”. Thus, 
Py(r,n) = a (n —m)" Po(r,n — m)/n". 


or, after simplifying, 


P(r, n) = (") a (" ¥ = (-1)' (1 = an) (1.8-10) 


1.9 BERNOULLI TRIALS—BINOMIAL AND MULTINOMIAL PROBABILITY LAWS 


Consider the very simple experiment consisting of a single trial with a binary outcome: a 
success {¢, =s} with probability p, 0 < p < 1, ora failure {¢, =f} with probability g = 1—p. 
Thus, P[s] = p, P[f] = q and the sample space is 2 = {s,f}. The o-field of events .Fis ¢, Q, 
{s}, {f}. Such an experiment is called a Bernoulli trial. 

Suppose we do the experiment twice. The new sample space Qe, written Ng = 2 x Q, 
is the set of all ordered 2-tuples 


OQ» = {ss, sf, fs, ff}. 


F contains 2* = 16 events. Some are ¢, , {ss}, {ss, ff}, and so forth. 
In the general case of n Bernoulli trials, the Cartesian product sample space becomes 


Q,=2N2xK2XxK...x*x 
ee” 


n times 
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and contains 2” elementary outcomes, each of which is an ordered n-tuple. Thus, 
On = {a1,---, am}, 


where M = 2” and a; = Zi, +++2i,, an ordered n-tuple, where z;, =s or f. Since each 
outcome z;, is independent of any other outcome, the joint probability P[z;,...2:,,] = 
Plz;,|P[zi.]... P[zi,,]. Thus, the probability of a given ordered set of k successes and n — k 
failures is simply p*q”—*. 


Example 1.9-1 
(repeated trials of coin toss) suppose we throw a coin three times with p = P[H] and 
q = P[T]. The probability of the event {HTH} is pgp = p?q. The probability of the event 
{THH} is also pq. The different events leading to two heads and one tail are listed here: 


E, = {HHT}, 
Ey = {HTH}, 
E; = {THH}. 


If F denotes the event of getting two heads and one tail without regard to order, then F = 
EF, UE.UES. Since E;E; = ¢ for alli 4 j, we obtain P[F] = P[E,]+ P[£2]+P[Es] = 3p7¢. 


Let us now generalize the previous result by considering an experiment consisting of n 
Bernoulli trials. The sample space 2, contains M = 2” outcomes a1, d@2,...,@,¢, where each 
a; is a string of n symbols, and each symbol represents a success s or a failure f. Consider 
the event Az = {k successes in n trials} and let the primed outcomes, that is, ai, denote 
strings with k successes and n — k failures. Then, with kK denoting the number of ordered 
arrangements involving & successes and n — k failures, we write 


K 
Ag = Usai}. 


To determine how large K is, we use an artifice similar to that used in proving Equation 
1.8-6. Here, let represent failures and stars represent successes. Then, as an example, 


represents five successes in nine tries in the order fssfssffs. How many such arrangements 


are there? The solution is given by Equation 1.8-6 with r = k and (n — 1) +r replaced by 
(n —k) +k =n. (Note that there is no restriction that the first and last symbols must be 


bars.) Thus, 
n 
«=(t) 
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and, since {a;} are disjoint, that is, {a;}M {a;} = ¢ for all i ¢ j, we obtain 


P[A,] = P 


K K 
Ute] = Pt 


k yn— 


Finally, since P[a/] = p*q”~* regardless of the ordering of the s’s and f’s, we obtain 


4 b(k:n, p). (1.9-1) 


Binomial probability law. The three-parameter function b(k;n,p) defined in Equation 
1.9-1 is called the binomial probability law and is the probability of getting k successes 
in n independent tries with individual Bernoulli trial success probability p. The binomial 


coefficient 
om n 
a= (%) 


was introduced in the previous section and is the number of subpopulations of size k that 
can be formed from a population of size n. In Example 1.9-1 about tossing a coin three 
times, the population has size 3 (three tries) and the subpopulation has size 2 (two heads), 
and we were interested in getting two heads in three tries with order being irrelevant. Thus, 
the correct result is C3? = 3. Note that had we asked for the probability of getting two 
heads on the first two tosses followed by a tail, that is, P[E,], we would not have used the 
coefficient C} since there is only one way that this event can happen. 


Example 1.9-2 
(draw two balls from urn) Suppose n = 4; that is, there are four balls numbered 1 to 4 in the 
urn. The number of distinguishable, ordered samples of size 2 that can be drawn without 
replacement is 12, that is, {1,2}; {1,3}; {1,4}; {2,1}; {2,3}; {2,4}; {3,1}; {3,2}; {3,4}; 
{4,1}; {4,2}; {4,3}. The number of distinguishable unordered sets is 6, that is, 


From Equation 1.8-2 we obtain this result directly; that is (n = 4, k = 2) 
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Example 1.9-3 
(binary pulses) Ten independent binary pulses per second arrive at a receiver. The error 
(i.e., a zero received as a one or vice versa) probability is 0.001. What is the probability of 
at least one error/second? 


Solution 


Plat least one error/sec] = 1 — P[no errors/sec] 


=1- ee (0.001)°(0.999)'° = 1 — (0.999)'° ~ 0.01. 


Observation. Note that 


b(k;n,p) =1. Why? 
k=0 


Example 1.9-4 
(odd-man out) An odd number of people want to play a game that requires two teams made 
up of even numbers of players. To decide who shall be left out to act as umpire, each of 
the N persons tosses a fair coin with the following stipulation: If there is one person whose 
outcome (be it heads or tails) is different from the rest of the group, that person will be 
the umpire. Assume that there are 11 players. What is the probability that a player will be 
“odd-man out,” that is, will be the umpire on the first play? 


Solution Let E 4 {10H,1T}, where 10H means H,H,...,H ten times, and 


F 2 {10T,1H}. Then EF = ¢ and 
P[EU F| = PE] + PIF 


(10) (@) @)+C@) GG) 


~ 0.01074. 


Example 1.9-5 
(more odd-man out) In Example 1.9-4 derive a formula for the probability that the odd-man 
out will occur for the first time on the nth play. (Hint: Consider each play as an independent 
Bernoulli trial with success if an odd-man out occurs and failure otherwise.) 


Solution Let F be the event of odd-man out for first time on the nth play. Let F’ be the 
event of no odd-man out in n — 1 plays and let G be the event of an odd-man out on the 
nth play. Then 


E=FG. 
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Since it is completely reasonable to assume F' and G are independent events, we can write 
P|E] = P[F|P[G] 


n—-1 


PIFI=( é ) (o.0107)°¢0.9808)"" = (0.9898)"—" 


P[G] = 0.0107. 


Thus, P[E] = (0.0107) (0.9893)”—!,n > 1, which is often referred to as a geometric distri- 
bution™ or law. 
Example 1.9-6 
(multiple lottery tickets) If you want to buy 50 lottery tickets, should you purchase them 
all in one lottery, or should you buy single tickets in 50 similar lotteries? For simplicity we 
take the case of a 100 ticket lottery with ticket prices of $1 each, and 50 such independent 
lotteries are available. Consider first the case of buying 50 tickets from one such lottery. Let 
EF, denote the event that the ith ticket is the winning ticket. Since any ticket is as likely to 
be the winning ticket as any other ticket, and not more than one ticket can be a winner, 
we have by classical probability that P[E;] = nwin/ntor = 1/100 for 1= 1,...,100. The 
event of winning the lottery is that one of the 50 purchased tickets is the winning ticket or, 
equivalently, with E denoting the event that one of the 50 tickets is the winner E = U?2, E; 
and P[E] = P[U®®,Ei] = 532, P[E;] = 50 x 1/100 = 0.5. Next we consider the case of 
buying 1 ticket in each of 50 separate lotteries. We recognize this as Bernoulli trials with 
an individual success probability p = 0.01 and gq = 0.99. With the aid of a calculator, we 
can find the probability of winning (exactly) once as 


Piwin once] = 6(1; 50, 0.01) 


= & (0.01)1(0.99)*° 


= 50 x 107? x 0.611 


= 0.306" 


Similarly, we find the probability of winning twice b(2;50,0.01) = 0.076, the probability 
of winning three times b(3;50,0.01) = 0.012, the probability of winning four times 
b(4; 50,0.01) = 0.001, and the probability of winning more times is negligible. As a check 
we can easily calculate the probability of winning at least once, 


P{win at least once] = 1 — P{loose every time] 


ae eo a 
= 1- (0.99) 
= 0.395. 


+A popular variant on this definition is the alternative geometric distribution given as pq™,n > 0 with 
q=1l-pand0<p<l. 

tWe use the notation [equals sign with dot over top] to indicate that all the decimal digits shown are 
correct. 


Sec. 1.9. BERNOULLI TRIALS—BINOMIAL AND MULTINOMIAL PROBABILITY LAWS 53 


Indeed we have 0.395 = 0.306+ 0.0764 0.012+0.001. We thus find that, if your only concern 
is to win at least once, it is better to buy all 50 tickets from one lottery. On the other hand, 
when playing in separate lotteries, there is the possibility of winning multiple times. So your 
average winnings may be more of a concern. Assuming a fair lottery with payoff $100, we 
can calculate an average winnings as 


100 x 0.306 + 200 x 0.076 + 300 x 0.012 + 400 x 0.001 
= 49.8. 


So, in terms of average winnings, it is about the same either way. 


Further discussion of the binomial law. We write down some formulas for further use. 
The probability B(k;n,p) of k or fewer successes in n tries is given by 


k k 


B(k;n,p) = )0 (isn, p) = > G ee. (1.9-2) 


i=0 i=0 


The symbol B(k;n,p) is called the binomial distribution function. The probability of k or 
more successes in 7 tries is 


i=k 


The probability of more than k successes but no more than 7 successes is 


i 
Y= b(é;:n,p) = Blin, p) — BCR; n,p). 
i=k+1 


There will be much more on distribution functions in later Chapters. We illustrate the 
application of this formula in Example 1.9-7. 


Example 1.9-7 
(missile attack) Five missiles are fired against an aircraft carrier in the ocean. It takes at 
least two direct hits to sink the carrier. All five missiles are on the correct trajectory but 
must get through the “point-defense” guns of the carrier. It is known that the point-defense 
guns can destroy a missile with probability p = 0.9. What is the probability that the carrier 
will still be afloat after the encounter? 


Solution Let FE be the event that the carrier is still afloat and let F' be the event of a 
missile getting through the point-defense guns. Then 


P[F] =0.1 
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and 


=1- 3 @ (0.1)'(0.9)°~* ~ 0.92. 


Multinomial Probability Law 


The multinomial probability law is a generalization of the binomial law. The binomial law 
is based on Bernoulli trials in which only two outcomes are possible. The multinomial law 
is based on a generalized Bernoulli trial in which / outcomes are possible. Thus, consider an 
elementary experiment consisting of a single trial with k elementary outcomes ¢,,¢9,.--,¢)- 
Let the probability of outcome ¢; be p; (t= 1,...,1). Then 


I 
pi20, and > pp=1. (1.9-3) 
ak 


Assume that this generalized Bernoulli trial is repeated n times and consider the event 
consisting of a prescribed, ordered string of elementary outcomes in which ¢, appears 11 
times, ¢, appears rz times, and so on until ¢; appears r; times. What is the probability of 
this event? The key here is that the order is prescribed a priori. For example, with | = 3 
(three possible outcomes) and n = 6 (six tries), a prescribed string might be ¢)¢3¢962¢1 C2 
so that rj, 2, 2 3, 73 1. Observe that ao r; = n. Since the outcome of each 
trial is an independent event, the probability of observing a prescribed ordered string is 
pps? ...p;'. Thus, for the string ¢;¢3¢2¢2¢,¢, the probability is p?p3p3. 

A different (greater) probability results when order is not specified. Suppose we perform 
n repetitions of a generalized Bernoulli trial and consider the event in which ¢, appears 
r , times, ¢) appears rg times, and so forth, without regard to order. Before computing the 
probability of this event we furnish an example. 


Example 1.9-8 
(busy emergency number) In calling the Sav-Yur-Life health care facility to report an emer- 
gency, one of three things can happen: 


(1) the line is busy (event EF); 
(2) you get the wrong number (event £2); and 
(3) you get through to the triage nurse (event 3). 


Assume P[E;] = p;. What is the probability that in five separate emergencies at different 
times, initial calls are met with four busy signals and one wrong number? 


Solution Let F denote the event of getting four busy signals and one wrong number. 
Then 


= Fy U Fy U F3 U Fy U Fs, where Fy, = {E, FE, E,E\ E>}, 
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Fo ={FE EF, E:E\}, Fy ={ME EEE}, y= {MELEE}, 


and 
Fs = {BoE EE, Ej}. 


Since F,F; = ¢, P[F] = >?_, P[Fi]. But P[F\] = ptp4p9 independent of i. Hence 
P(F] = 5pip3p3. 
With the assumed p; = 0.3, pz = 0.1, ps = 0.6, we get 


PIF] =5 x 8.1 x 107? x 0.1 x 1 = 0.004. 


In problems of this type we must count all the strings of length n in which ¢, appears 
r, times, Cy appears r times, and so on. In the example just considered, there were five 
such strings. In the general case of n trials with r; outcomes of ¢,, r2 outcomes of C5, and 
so on, there are 
n! 


(1.9-4) 


rylrol.. ryt? 


such strings. In Example 1.9-8, n = 5, r; = 4, rg = 1, r3 = 0 so that 
5! 
mo” 
The number in Equation 1.9-4 is recognized as the multinomial coefficient. To check that it 


is the appropriate coefficient, consider the r; outcomes ¢,. The number of ways of placing 
the r; outcomes ¢, among the n trials is identical with the number of subpopulations of 


size r; in a population of size n which is ( " ). That leaves n — r, trials among which we 


Fi. 
wish to place rg outcomes ¢,. The number of ways of doing that is (" . " ), Repeating 
2 


this process we obtain the total number of distinguishable arrangements 


n nm—-Ty N—-T1—7T2..-—T1-1 \ _ n! 
ry r2 > rl rylre!...rz! 


Example 1.9-9 
(repeated generalized Bernoulli) Consider four repetitions of a generalized Bernoulli experi- 
ment in which the outcomes are *, e, 0. What is the number of ways of getting two *, one e, 
and one 0. 


9 = 6. If we let the 


spaces between bars represent a trial, then we can denote the outcomes as 


Solution The number of ways of getting two * in four trials is ( 


Pee te eT eT TT eee Let ep ee | Lael te fae [ah 
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The number of ways of placing e among the two remaining cells is (7) = 2. The number 


of ways of placing 0 among the remaining cell is (;) = 1. Hence the total number of 


arrangements is 6-2-1 = 12. They are 


eoeoore# Ox * * * * * 
oe x * * *¥* @ Ox *¥ @ © 
*¥ * Oe * * * *¥ OOO ®@ 
* * * *¥ OF8 CD80 8 * * 


We can now state the multinomial probability law. Consider a generalized Bernoulli trial 
with outcomes ¢,,¢5,...,¢, and let the probability of observing outcome ¢,; be p;, 7 = 
1,...,1, where p; > 0 and eS p; = 1. The probability that in n trials ¢, occurs 71 times, 
Cy occurs Trg times, and so on is 


! 
nt ry re 


P(r;n,p) = yh ee ae (1.9-5) 


ry!re! . 


where r and p are [-tuples defined by 
1 
r=(ri,r2,---,71), p= (pi, p2,---,pi), and on =. 
i=1 


Observation. With | = 2, Equation 1.9-5 becomes the binomial law with p; = D, 


po = ie Pp, Ty = k, and rg 4S n—k. Functions such as Equations 1.9-1 and 1.9-5 are 
often called probability mass functions. 


Example 1.9-10 
(emergency calls) In the United States, 911 is the all-purpose number used to summon an 
ambulance, the police, or the fire department. In the rowdy city of Nirvana in upstate New 
York, it has been found that 60 percent of all calls request the police, 25 percent request 
an ambulance, and 15 percent request the fire department. We observe the next ten calls. 
What is the probability of the combined event that six calls will ask for the police, three 
for ambulances, and one for the fire department? 
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Solution Using Equation 1.9-5 we get 


P(6,3, 1; 10, 0.6, 0.25, 0.15) 


10! 


~ 613M! (0.6)°(0.25)°(0.15)* ~ 0.092. 


A numerical problem appears if n gets large. For example, suppose we observe 100 calls and 
consider the event of 60 calls for the police, 30 for ambulances, and 10 for the fire department; 
clearly computing numbers such as 100!, 60!, 30! requires some care. An important result 
that helps in evaluating such large factorials is Stirling’s! formula: 


nl ~ (Qr)V/Anrt0/2)e-m, 


where the approximation improves as n increases, for example, 


n n! Stirling’s formula Percent error 
1 1 0.922137 8 
10 3,628,800 3,598,700 0.8 


When using a numerical computer to evaluate Equation 1.9-5, additional care must be used 
to avoid loss of accuracy due to under- and over-flow. A joint evaluation of pairs of large 
and small numbers can help in this regard, as can the use of logarithms. 

As stated earlier, the binomial law is a special case, perhaps the most important case, 
of the multinomial law. When the parameters of the binomial law attain extreme values, the 
binomial law can be used to generate another important probability law. This is explored 
next. 


1.10 ASYMPTOTIC BEHAVIOR OF THE BINOMIAL LAW: THE POISSON 
LAW 


Suppose that in the binomial function b(k;n,p),n >> 1, p << 1, but np remains constant, 
say np = pw. Recall that q = 1 — p. Hence 


where n(n — 1)...(n —k +1) © n* if n is allowed to become large enough and k is held 
fixed. Hence in the limit as n — oo, p — 0, and k << n, we obtain 


1 nk _ 
b(k;n, p) & Pl (1 _ ) atts ae (1.10-1) 


+ James Stirling, eighteenth-century mathematician. 
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Thus, in situations where the binomial law applies with n >> 1, p << 1 but np = wisa 
finite constant, we can use the approximation 


b(k;n, p) ~ Pe, (1.10-2) 


Poisson law. The Poisson probability law, with parameter ju(> 0), is given as 


uk 
p(k) = me 0<k<o. 
Unlike the binomial law, the Poisson law just has one parameter yz that can take on any 
positive value. 


Example 1.10-1 
(time to failure) A computer contains 10,000 components. Each component fails indepen- 
dently from the others and the yearly failure probability per component is 107+. What is 
the probability that the computer will be working one year after turn-on? Assume that the 
computer fails if one or more components fail. 


Solution 
p= 10-*, n = 10,000, k=0, np =1. 


Hence 
) 


b(0; 10,000, 10~*) = 


i 
= a == 0.368. 


Example 1.10-2 
(random points in time) Suppose that n independent points are placed at random in an 
interval (0,7). Let 0 < t, < tg <T and ty —t, 27. Let 7/T << 1 and n >> 1. What is 
the probability of observing exactly & points in 7 seconds? (Figure 1.10-1.) 


Solution Consider a single point placed at random in (0,7). The probability of the point 
appearing in 7 is T/T. Let p= 7/T. Every other point has the same probability of being in 
Tt seconds. Hence, the probability of finding & points in 7 seconds is the binomial law 


P[k points in 7 sec] = (7) re (1.10-3) 


With n >> 1, we use the approximation in Equation 1.10-1 to give 


k e-(n7/T) 
=) a (1.10-4) 


b(k;n, p) = (= a 


ac 
where n/T can be interpreted as the “average” number of points per unit interval. 


Replacing the average rate in this example with parameter jz (u > 0), we get the Poisson 
law defined by 


k 
P{k points] = aa (1.10-5) 
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) t, ty T 


Figure 1.10-1 Points placed at random on a line. Each point is placed with equal likelihood anywhere 
along the line. 


where k = 0,1,2,.... With uw ES \r, where is the average number of points! per unit time 
and rT is the length of the interval (t,t + 7], the probability of & points in elapsed time 7 is 


(Ar)! 
ki! 


P(kit,t-+7)=e°*%7 (1.10-6) 
For the Poisson law, we also stipulate that numbers of points arriving in disjoint time inter- 
vals constitute independent events. We can regard this as inherited from an underlying set 
of Bernoulli trials, which are always independent. 

In Equation 1.10-6 we assume that \ is a constant and not a function of t. If \ varies 
with t, we can generalize AT with the integral 7 A(u) du, and the probability of k points 
in the interval (t, t + 7] becomes 


Poster sen [-[xegda] EEL seoae] a0 


The Poisson law P[k events in Az], or more generally P[k events in (x,a + Ax)], where x 
is time, volume, distance, and so forth and Az is the interval associated with a, is widely 
used in engineering and sciences. Some typical situations in various fields where the Poisson 
law is applied are listed below. 


Physics. In radioactive decay—P|k a-particles in t seconds] with A the average 
number of emitted a-particles per second. 

Engineering. In planning the size of a call center—P[k telephone calls in 7 
seconds] with \ the average number of calls per second. 

Biology. In water pollution monitoring—P[k coliform bacteria in 1000 cubic centime- 
ters] with \ the average number of coliform bacteria per cubic centimeter. 

Transportation. In planning the size of a highway toll facility ——P[k automobiles 
arriving in 7 minutes] with \ the average number of automobiles per minute. 

Optics. In designing an optical receiver—P|k photons per second over a surface 
of area A] with \ the average number of photons-per-second per unit area. 

Communications. In designing a fiber optical transmitter—receiver link—P|k photoelec- 
trons generated at the receiver in one second] with A the average number of photo- 
electrons per second. 


+The term points here is a generic term. Equally appropriate would be “arrivals,” “hits,” “occurrences,” 
etc. 
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The parameter A is often called, in this context, the Poisson rate parameter. Its dimen- 
sions are points per unit interval, the interval being time, distance, volume, and so forth. 
When the form of the Poisson law that we wish to use is as in Equation 1.10-6 or 1.10-7, 
we speak of the Poisson law with rate parameter \ or rate function X(t). 


Example 1.10-3 
(misuse of probability) (a) “Prove” that there must be life in the universe, other than that 
on our own Earth, by using the following numbers': average number of stars per galaxy, 
300 x 10°; number of galaxies, 100 x 10°; probability that a star has a planetary system, 0.5; 
average number of planets per planetary system, 9; probability that a planet can sustain 
life, 1/9; probability, p, of life emerging on a life-sustaining planet, 10~!”. 


Solution First we compute, nys, the number of planets that are life-sustaining: 
nis = 300 x 10° x 100 x 109 x 0.5 x 9 x 1/9 
alo 1. 


Next we use the Poisson approximation to the binomial with a = nig p = 1.5 x 10? x 107 }”, 
for computing the probability of no life outside of Earth’s and obtain 


(1.5 x 10)? 6-1.5x 101° 
0! 


Hence we have just “shown” that the probability of life outside Earth has a probability of 
unity, that is, a sure bet. Note that the number for life emerging on other planets, 1071”, 
is impressively low. 

(b) Now show that life outside Earth is extremely unlikely by using the same set of 
numbers except that the probability of life emerging on a life-sustaining planet has been 
reduced to 107°. 


6(0,1.5:% 107, 10-*).= ~ 0. 


Solution Using the Poisson approximation to the binomial, with a = 1.5 x 107? x 1073? = 
1.5 x 10-8, we obtain for the probability of no life outside Earth’s: 


(1.5 x 107*)" j-1.5x10-8 
0! 
mw 1—(1.5 x 107°) ~1, 


BO 1.5 KI, 10-9) = 


where we have used the approximation e~* = 1 — x for small z. 

Thus, by changing only one number, we have gone from “proving” that the universe 
contains extraterrestrial life to proving that, outside of ourselves, the universe is lifeless. 
The reason that this is a misuse of probability is that, at present, we have no idea as to the 
factors that lead to the emergence of life from nonliving material. While the calculation is 
technically correct, this example illustrates the use of contrived numbers to either prove or 
disprove what is essentially a belief or faith. 


+All the numbers have been quoted at various times by proponents of the idea of extraterrestrial life. 
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Example 1.10-4 
(website server) A website server receives on the average 16 access requests per minute. If 
the server can handle at most 24 accesses per minute, what is the probability that in any 
one minute the website will saturate? 


Solution Saturation occurs if the number of requests in a minute exceeds 24. The prob- 
ability of this event is 


oo —AT 
Psaturation] = > [Ar]** A (1.10-8) 
k=25 : 
_ oS pen? - - : 
= 5° [16] Tn = 0.017 ~ 1/60. (1.10-9) 
k=25 : 


Thus, about once in every 60 minutes (on the “average”) will a visitor be turned away. 


Given the numerous applications of the Poisson law in engineering and the sciences, 
one would think that its origin is of somewhat more noble birth than “merely” as a limiting 
form of the binomial law. Indeed this is the case, and the Poisson law can be derived once 
three assumptions are made. Obviously these three assumptions should reasonably mirror 
the characteristics of the underlying physical process; otherwise our results will be of only 
marginal interest. Fortunately, in many situations these assumptions seem to be quite valid. 

In order to be concrete, we shall talk about occurrences taking place in time (as opposed 
to, say, length or distance). The Poisson law is based on the following three assumptions: 


1. The probability, P(1;t,¢+At), of a single event occurring in (t, t+ At] is proportional 
to At, that is, 
P(1;t,t+At)~XAAt At. (1.10-10) 


In Equation 1.10-10, A(t) is the Poisson rate parameter. 
2. The probability of k(k& > 1) events in (t,t + At] goes to zero: 


P(kit,t+At)~O At30, k=2,3,.... (1.10-11) 


3. Events in nonoverlapping time intervals are statistically independent.' 


Starting with these three simple physical assumptions, it is a straightforward task to 
obtain the Poisson probability law. We leave this derivation to Chapter 9 but merely point 
out that the clever use of the assumptions leads to a set of elementary, first-order differ- 
ential equations whose solution is the Poisson law. The general solution is furnished by 
Equation 1.10-7 but, fortunately, in a large number of physical situations the Poisson rate 


tNote in property 3 we are talking about disjoint time intervals, not disjoint events. For disjoint events 
we would add probabilities, but for disjoint time intervals which lead to independent events in the Poisson 
law, we multiply the individual probabilities. 
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parameter A(t) can be approximated by a constant, say, A. In that case Equation 1.10-6 can 
be applied. We conclude this section with a final example. 


Example 1.10-5 
(defects in digital tape) A manufacturer of computer tape finds that the defect density along 
the length of tape is not uniform. After a careful compilation of data, it is found that for 
tape strips of length D, the defect density A() along the tape length x varies as 


1 2 
A) = Ao + 5(A1 ~ Ao) (1 + cos 7) ~ Berea 


for 0 < a < D due to greater tape contamination at the edges « = 0 and « = D. 


(a) What is the meaning of (a) in this case? 
(b) What is the average number of defects for a tape strip of length D? 
(c) What is an expression for k defects on a tape strip of length D? 
(d) What are the Poisson assumptions in the case? 
Solution 


(a) Bearing in mind that A(a) is a defect density, that is, the average number of defects 
per unit length at 2, we conclude that A(x)Az is the average number of defects in 
the tape from x to x + Az. 

(b) Given the definition of \(a), we conclude that the average number of defects along 
the whole tape is merely the integral of A(x), that is, 


a xis [ | s(t ao (: $65 a] de 


_ Ao + AL 
a 


2 A, 


(c) Assuming the Poisson law holds, we use Equation 1.10-7 with 7 and Az (distances) 


replacing ¢ and r (times). Thus, 
atAx 1 rtAg 
-| na = / no 


In particular, with « = 0 and a + Az = D, we obtain 


k 
P(k;x,a2 + Ax) = exp 


—A 
. _ ak& 
P(k;0,D) = A's, 


where A is as defined above. 
(d) The Poisson assumptions become 


(i) Pll;a,a+ Aa] ~ A(x)Az, as Ax 0. 
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(ii) Pik; 2,7+Aqa] =0 Az —0, for k=2,3,...; that is, the probability 
of there being more than one defect in the interval (x, + Ax) as Ax 
becomes vanishingly small is zero. 

(iii) the occurrences of defects (events) in nonoverlapping sections of the tape 
are independent. 


1.11 NORMAL APPROXIMATION TO THE BINOMIAL LAW 


In this section we give, without proof, a numerical approximation to binomial probabilities 
and binomial sums. Let S;, denote the event consisting of (exactly) k successes in n Bernoulli 
trials. Then the probability of 5; follows a binomial distribution and 


P[Sx] = |) pd’ * =t(hn.p), OSk <n. (iat) 


For large values of n and k, Equation 1.11-1 may be difficult to evaluate numerically. Also, 
the probability of the event {k, <number of successes< kz} may involve many terms, making 
a direct evaluation of its probability P[k, <number of successes< kg] difficult. Fortunately, 
when n is large, we can use approximate methods for evaluating such probabilities. These 
approximate methods involve the so-called Normal or Gaussian distribution. 

The Normal distribution and its significance will be discussed in greater detail in 
Chapter 2 and subsequent chapters in this book. Here we use it only to help evaluate 
binomial probabilities. For the present, define the function fsj(a), known as the standard 
Normal density, by 


fsn(a) = _ exp (-3:"), (@iii-3} 


and its running integral, known as the standard Normal cumulative distribution function, 


by 


Fsn(a) = = - exp (-3") dy. (1.11-3) 


k- 
b(k;n,p) © wats (Ge). (1.11-4) 


The approximation becomes better when npg >> 1. We reproduce the results from 
[1-8] in Table 1.11-1. Even in this case, npg = 1.6, the approximation is quite good. The 
approximation for sums, when n >> 1 and k, and kg are fixed integers, takes the form 

eas) z jaa) 
N 
VRP V Pq 


P{k, < number of successes < kg] © Fs n | 


(1.11-5) 
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Table 1.11-1 Normal Approximation to the 
Binomial for Selected Numbers 


k, | b(k;10,0.2) | Normal approximation 
0 0.1074 0.0904 
1 0.2864 0.2307 
2 0.3020 0.3154 
3 0.2013 0.2307 
4 0.0880 0.0904 
5 0.0264 0.0189 
6 0.0055 0.0021 


Table 1.11-2 Event Probabilities Using the Normal 
Approximation (Adapted from [1-8]) 


Normal 
n|p|a} 6 |Pla< S, < 6] | approximation 
200 | 0.5 | 95 | 105 | 0.5632 0.5633 
500} 0.1} 50} 55 | 0.3176 0.3235 
100 | 0.3] 12) 14 | 0.00015 0.00033 
100 | 0.3] 27] 29 | 0.2379 0.2341 
100 | 0.3] 49 | 51 | 0.00005 0.00003 


Some results, for various values of n, p, k1, k2, are furnished in Table 1.11-2, which uses the 
results in [1-8]. 

In using the Normal approximation, one should refer to Table 2.4-1. In Table 2.4-1 a 
function called erf() is given rather than F'sy (a). The erf(x) is defined by 


ae ee 
erf(x) = a e 2 dy. 
0 


However, since it is easy to show that 


Fsy(«) =$+erf(x), x>0, (1.11-6) 


and 


Fsn (x) =1-erf(|x|), 2 <0, (1.11-7) 
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we can compute Equation 1.11-5 in terms of the table values. Thus, with a 4 anv 


and b4 kom npt0.8 and b’ > a’, we can use the results in Table 1.11-3. 


The Normal approximation is also useful in evaluating Poisson sums. For example, a 
sum such as in Equation 1.10-9 is tedious to evaluate if done directly. However, if Ar >> 1, 
we can use the Normal approximation to the Poisson law, which is merely an extension of the 
Normal approximation to the binomial law. This extension is expected since we have seen 
that the Poisson law is itself an approximation to the binomial law under certain circum- 
stances. From the results given above we are able to justify the following approximation. 


B k Ig 
es 1 :) 
e = exp | —=y* } dy, 1.11-8 
2d k\ \/ 27 ly 2 ( ) 
where 
i, 2 B-ATr+0.5 
: Var 
and 
A a-—AT—0.5 


Another useful approximation is 


bel’ 1. 7? 1 
e a = =| exp (-3") dy, (1.11-9) 


where 
1 A k—Ar+0.5 
‘yr 
and 
A k—dAT—0.5 
lg = 


For example, with Ar = 5, and k = 5, the error in using the Normal approximation of 
Equation 1.11-9 is less than 1 percent. 


SUMMARY 


In this, the first chapter of the book, we have reviewed some different definitions of proba- 
bility. We developed the axiomatic theory and showed that for a random experiment three 
important objects were required: the sample space 2, the sigma field of events .~% and a 
probability measure P. The mathematical triple (Q,.% P) is called the probability space 7% 

We introduced the important notions of independent, dependent, and compound events, 
and conditional probability. We developed a number of relations to enable the application 
of these. 
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We discussed some important formulas from combinatorics and briefly illustrated how 
important they were in theoretical physics. We then discussed the binomial probability law 
and its generalization, the multinomial law. We saw that the binomial law could, when 
certain limiting conditions were valid, be approximated by the Poisson law. The Poisson 
law, one of the central laws in probability theory, was shown to have application in numerous 
branches of science and engineering. We stated, but deferred verification until Chapter 9, 
that the Poisson law can be derived directly from simple and entirely reasonable physical 
assumptions. 

Approximations for the binomial and Poisson laws, based on the Normal distribu- 
tion, were furnished. Several occupancy problems of engineering interest were discussed. 
In Chapter 4 we shall revisit these problems. 


PROBLEMS 


(*Starred problems are more advanced and may require more work and/or additional 
reading.) 


1.1 In order for a statement such as “Ralph is probably guilty of theft” to have meaning 
in the relative frequency approach to probability, what kind of data would one need? 

1.2. Problems in applied probability (a branch of mathematics called statistics) often 
involve testing P — Q (P implies Q) type statements, for example, if she smokes, 
she will probably get sick; if he is smart he will do well in school. You are given a 
set of four cards that have a letter on one side and a number on the other. You are 
asked to test the rule “If a card has a D on one side, it has a three on the other.” 
Which of the following cards should you turn over to test the veracity of the rule: 


Card 1 Card 2 Card 3 Card 4 


Be careful here! 

1.3 In a spinning-wheel game, the spinning wheel contains the numbers 1 to 9. The 
contestant wins if an even number shows. What is the probability of a win? What 
are your assumptions? 

1.4 A fair coin is flipped three times. The outcomes on each flip are heads H or tails T. 
What is the probability of obtaining two heads and one tail? 

1.5 An urn contains three balls numbered 1, 2, 3. The experiment consists of drawing a 
ball at random, recording the number, and replacing the ball before the next ball is 
drawn. This is called sampling with replacement. What is the probability of drawing 
the same ball twice in two tries? 

1.6 An experiment consists of drawing two balls without replacement from an urn 
containing six balls numbered 1 to 6. Describe the sample space Q. What is Q if 
the ball is replaced before the second is drawn? 
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1.7 


1.8 


1.9 


1.14 


The experiment consists of measuring the heights of each partner of a randomly 
chosen married couple. (a) Describe Q in convenient notation; (b) let E be the event 
that the man is shorter than the woman. Describe F in convenient notation. 

An urn contains ten balls numbered 1 to 10. Let E be the event of drawing a ball 
numbered no greater than 5. Let F' be the event of drawing a ball numbered greater 
than 3 but less than 9. Evaluate E°, F°, EF, EUF, EF°, E°UF*, EF°U ESP, 
EFUE‘SF*, (EUF)*, and (EF)°. Express these events in words. 

There are four equally likely outcomes ¢,,¢5,¢3, and ¢, and two events A = {¢,,¢5} 
and B = {¢5,¢3}. Express the sets (events) AB°, BA°, AB, and AU B in terms of 
their elements (outcomes). 

Verify the useful set identities A= ABU AB* and AU B = (AB‘) U(BA‘)U (AB). 
Does probability add over these unions? Why? 

In a given random experiment there are four equally likely outcomes ¢,,¢2,¢3, and C4. 


Let the event A & {¢,,¢5}. What is the probability of A? What is the event (set) A‘ 
in terms of the outcomes? What is the probability of A°? Verify that PLA] = 1—P{[A‘] 
here. 

Consider the probability space (Q,F, P) for this problem. 


(a) State the three axioms of probability theory and explain in a sentence the 
significance of each. 

(b) Derive the following formula, justifying each step by reference to the appro- 
priate axiom above, 


PIAU B)] = Pi[A] + P[B] — PAB), 


where A and B are arbitrary events in the field F. 


An experiment consists of drawing two balls at random, with replacement from 
an urn containing five balls numbered 1 to 5. Three students “Dim,” “Dense,” and 
“Smart” were asked to compute the probability p that the sum of numbers appearing 
on the two draws equals 5. Dim computed p = = arguing that there are 15 distin- 
guishable unordered pairs and only 2 are favorable, that is, (1,4) and (2,3). Dense 
computed p = 3, arguing that there are 9 distinguishable sums (2 to 10), of which 
only 1 was favorable. Smart computed p = oct arguing that there were 25 distin- 
guishable ordered outcomes of which 4 were favorable, that is, (4,1), (3,2), (2,3), 
and (1,4). Why is p = + the correct answer? Explain what is wrong with the 
reasoning of Dense and Dim. 

Prove the distributive law for set intersection, that is, 


AN(BUC) =(ANB)U(ANC), 


by showing that each side is contained in the other. 

Prove the general result P[A] = 1 — P[A‘] for any probability experiment and any 
event A defined on this experiment. 

Let Q = {1,2,3,4,5,6}. Define three events: A = {1,2}, B = {2,3}, and C = 
{4,5,6}. The probability measure is unknown, but it satisfies the three axioms. 
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1.19 


1.20 


1.21 


*1.22 


1.23 


1.24 


1.25 


(a) What is the probability of AN C? 

(b) What is the probability of AU BUC? 

(c) State a condition on the probability of either B or C that would allow them 

to be independent events. 

Use the axioms given in Equations 1.5-1 to 1.5-3 to show the following: (EF € .% 
F €F) (a) Pld] =0; (b) PIEF*] = P(E] — P[EF); (c) P[E] =1— PIE. 
Use the probability space (0, F, P) for this problem. What is the difference between 
an outcome, an event, and a field of events? 


Use the axioms of probability to show the following: (A € F,B € F): P[AUB]= 
P|A] + P[B] — P[AN B], where P is the probability measure on the sample space 
Q, and F is the field of events. 

Use the “exclusive-or” operator in Equation 1.4-3 to show that P[E@ F] = P[EF°|+ 


P[ECF). 
Show that P[E © F| in the previous problem can be written as P[E @ F] = P[E|+ 
P|F] — 2P[EF). 

Let the sample space = {cat, dog, goat, pig}. 


(a) Assume that only the following probability information is given: 


P{{cat, dog}] = 0.9, 


] 
P{{goat, pig}] = 0.1, 
Pl{pig}] = 0.05, 
P{{dog}] = 0.5. 


For this given set of probabilities, find the appropriate field of events .F 
so that the overall probability space (Q,.% P) is well defined. Specify the 
field Y by listing all the events in the field, along with their corresponding 
probabilities. 

(b) Repeat part (a), but without the information that P[{pig}] = 0.05. 


Prove the distributive law for set intersection, that is, 
AN(BUC)=(ANB)U(ANQ), 


by showing that each side is contained in the other. 

In a given random experiment there are four equally likely outcomes ¢,,¢2,¢3, and C4. 
Let the event A = {¢,,¢}. What is the probability of A? What is the event (set) A° 
in terms of the outcomes? What is the probability of A°? Verify that P[A] = 1—P{[A‘] 
here. 

An urn contains eight balls. The letters a and b are used to label the balls. Two balls 
are labeled a, two are labeled 6, and the remaining balls are labeled with both letters, 
that is a,b. Except for the labels, all the balls are identical. Now a ball is drawn at 
random from the urn. Let A and B represent the events of observing letters a and 
b, respectively. Find P[A], P[|B], and P[AB]. Are A and B independent? (Note that 


o] 


you will observe the letter @ when you draw either an a ball or an a,b ball.) 
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1.26 


1.27 


1.28 


1.29 


1.30 


1.31 


1.32 


A fair die is tossed twice (a die is said to be fair if all outcomes 1,...,6 are equally 
likely). Given that a 3 appears on the first toss, what is the probability of obtaining 
the sum 7 after the second toss? 

A card is selected at random from a standard deck of 52 cards. Let A be the event 
of selecting an ace and let B be the event of selecting a red card. There are 4 aces 
and 26 red cards in the normal deck. Are A and B independent? 

A fair die is tossed three times. Given that a 2 appears on the first toss, what is the 
probability of obtaining the sum 7 on the three tosses? 

A random-number generator generates integers from 1 to 9 (inclusive). All outcomes 
are equally likely; each integer is generated independently of any previous integer. 
Let © denote the sum of two consecutively generated integers; that is, © = Ni + No. 
Given that © is odd, what is the conditional probability that U is 7? Given that 
x > 10, what is the conditional probability that at least one of the integers is > 7? 
Given that N; > 8, what is the conditional probability that / will be odd? 

The following problem was given to 60 students and doctors at the famous Hevardi 
Medical School (HMS): Assume there exists a test to detect a disease, say D, whose 
prevalence is 0.001, that is, the probability, P[D], that a person picked at random is 
suffering from D, is 0.001. The test has a false-positive rate of 0.005 and a correct 
detection rate of 1. The correct detection rate is the probability that if you have 
D, the test will say that you have D. Given that you test positive for D, what is 
the probability that you actually have it? Many of the HMS experts answered 0.95 
and the average answer was 0.56. Show that your knowledge of probability is greater 
than that of the HMS experts by getting the right answer of 0.17. 

Henrietta is 29 years old and physically very fit. In college she majored in geology. 
During her student days, she frequently hiked in the national forests and biked in the 
national parks. She participated in anti-logging and anti-mining operations. Now, 
Henrietta works in an office building in downtown Nirvana. Which is greater: the 
probability that Henrietta’s occupation is that of office manager; or the probability 
that Henrietta is an office manager who is active in nature-defense organizations like 
the Sierra Club? 

In the ternary communication channel shown in Figure P1.32 a 3 is sent three times 
more frequently than a 1, and a 2 is sent two times more frequently than a 1. A 1 is 
observed; what is the conditional probability that a 1 was sent? 


PIY=1|X=1]=1-a@ 


O Y=1 


PIY=2|X=2]=1-8 


PIY=3|X=3]=1-7 


Figure P1.32 Ternary communication channel. 
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1.33 


1.34 


1.35 


1.36 


1.37 


*1.38 


*1.39 


*1.40 


A large class in probability theory is taking a multiple-choice test. For a particular 
question on the test, the fraction of examinees who know the answer is p; 1 — p is the 
fraction that will guess. The probability of answering a question correctly is unity 
for an examinee who knows the answer and 1/m for a guessee; m is the number of 
multiple-choice alternatives. Compute the probability that an examinee knew the 
answer to a question given that he or she has correctly answered it. 

In the beauty-contest problem, Example 1.6-12, what is the probability of picking 
the most beautiful contestant if we decide a priori to choose the ith (1 <i< N) 
contestant? 

Assume there are three machines A, B, and C in a semiconductor manufacturing 
facility that make chips. They manufacture, respectively, 25, 35, and 40 percent 
of the total semiconductor chips there. Of their outputs, respectively, 5, 4, and 2 
percent of the chips are defective. A chip is drawn randomly from the combined 
output of the three machines and is found defective. What is the probability that 
this defective chip was manufactured by machine A? by machine B? by machine C? 
In Example 1.6-12, plot the probability of making a correct decision versus a/N, 
assuming that the “wait-and-see” strategy is adopted. In particular, what is P[D] 
when a/N = 0.5. What does this suggest about the sensitivity of P[D] vis-a-vis a 
when a is not too far from ag and N is large? 

In the village of Madre de la Paz in San Origami, a great flood displaces 103 villagers. 
The government builds a temporary tent village of 30 tents and assigns the 103 
villagers randomly to the 30 tents. 


(a) Identify this problem as an occupancy problem. What are the analogues to 
the balls and cells? 

(b) How many distinguishable distributions of people in tents can be made? 

(c) How many distinguishable distributions are there in which no tent remains 
empty? 


Consider r indistinguishable balls (particles) and n cells (states) where n > r. The 
r balls are placed at random into the n cells (multiple occupancy is possible). What 
is the probability P that the r balls appear in r preselected cells (one to a cell)? 
Assume that we have r indistinguishable balls and n cells. The cells can at most 
hold only one ball. As in the previous problem r < n. What is the probability P 
that the r balls appear in r preselected cells? 

Three tribal elders win elections to lead the unstable region of North Vatisthisstan. 
Five identical assault rifles, a gift of the people of Sodabia, are airdropped among 
a meeting of the three leaders. The tribal leaders scamper to collect as many of the 
rifles as they each can carry, which is five. 


(a) Identify this as an occupancy problem. 

(b) List all possible distinguishable distribution of rifles among the three tribal 
leaders. 

(c) How many distinguishable distributions are there where at least one of the 
tribal leaders fails to collect any rifles? 

(d) What is the probability that all tribal leaders collect at least one rifle? 
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(e) What is the probability that exactly one tribal leader will not collect any 
rifles? 
1.41 In some casinos there is the game Sic bo, in which bettors bet on the outcome of 
a throw of three dice. Many bets are possible each with a different payoff. We list 
some of them below with the associated payoffs in parentheses: 


(a) Specified three of a kind (180 to 1); 


(b) Unspecified three of a kind (30 to 1); 

(c) Specified two of a kind (10 to 1); 

(d) Sum of three dice equals 4 or 17 (60 to 1) 

(e) Sum of three dice equals 5 or 16 (30 to 1); 

(f) Sum of three dice equals 6 or 15 (17 to 1); 

(g) Sum of three dice equals 7 or 14 (12 to 1) 

(h) Sum of three dice equals 8 or 13 (8 to 1); 

(i) Sum of three dice equals 9, 10, 11, 12 (6 to 1); 

(j) Specified two dice combination; that is, of the three dice displayed, two of 


them must match exactly the combination wagered (5 to 1). 


We wish to compute the associated probabilities of winning from the player’s point 
of view and his expected gain. 

1.42 Most communication networks use packet switching to create virtual circuits between 
two users, even though the users are sharing the same physical channel with others. 
In packet switching, the data stream is broken up into packets that travel different 
paths and are reassembled in the proper chronological order and at the correct 
address. Suppose the order information is missing. Compute the probability that a 
data stream broken up into N packets will reassemble itself correctly, even without 
the order information. 

1.43 Inthe previous problem assume that N = 3. A lazy engineer decides to omit the order 
information in favor of repeatedly sending the data stream until the packets re-order 
correctly for the first time. Derive a formula that the correct re-ordering occurs for 
the first time on the nth try. How many repetitions should be allowed before the 
cumulative probability of a correct re-ordering for the first time is at least 0.95? 

1.44 Prove that the binomial law b(k;n,p) is a valid probability assignment by showing 
that >>)» 0(k;n,p) = 1. 

1.45 War-game strategists make a living by solving problems of the following type. There 
are 6 incoming ballistic missiles (BMs) against which are fired 12 antimissile missiles 
(AMMs). The AMMs are fired so that two AMMs are directed against each BM. 
The single-shot-kill probability (SSKP) of an AMM is 0.8. The SSKP is simply the 
probability that an AMM destroys a BM. Assume that the AMM’s don’t interfere 
with each other and that an AMM can, at most, destroy only the BM against 
which it is fired. Compute the probability that (a) all BMs are destroyed, (b) at 
least one BM gets through to destroy the target, and (c) exactly one BM gets 
through. 

1.46 Assume in the previous problem that the target was destroyed by the BMs. What 
is the conditional probability that only one BM got through? 
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A computer chip manufacturer finds that, historically, for every 100 chips produced, 
85 meet specifications, 10 need reworking, and 5 need to be discarded. Ten chips are 
chosen for inspection. 


(a) What is the probability that all 10 meet specs? 

(b) What is the probability that 2 or more need to be discarded? 

(c) What is the probability that 8 meet specs, 1 needs reworking, and 1 will be 
discarded? 


Unlike the city of Nirvana, New York, where 911 is the all-purpose telephone number 
for emergencies, in Moscow, Russia, you dial 01 for a fire emergency, 02 for the police, 
and 03 for an ambulance. It is estimated that emergency calls in Russia have the 
same frequency distribution as in Nirvana, namely, 60 percent are for the police, 
25 percent are for ambulance service, and 15 percent are for the fire department. 
Assume that 10 calls are monitored and that none of the calls overlap in time and 
that the calls constitute independent trials. 

A smuggler, trying to pass himself off as a glass-bead importer, attempts to smuggle 
diamonds by mixing diamond beads among glass beads in the proportion of one 
diamond bead per 1000 beads. A harried customs inspector examines a sample of 
100 beads. What is the probability that the smuggler will be caught, that is, that 
there will be at least one diamond bead in the sample? 

Assume that a faulty receiver produces audible clicks to the great annoyance of the 
listener. The average number of clicks per second depends on the receiver tempera- 
ture and is given by A(r) = 1 — e~7/1°, where 7 is time from turn-on. Evaluate the 
formula for the probability of 0,1,2,... clicks during the first 10 seconds of operation 
after turn-on. Assume the Poisson law. 

A frequently held lottery sells 100 tickets at $1 per ticket every time it is held. One 
of the tickets must be a winner. A player has $50 to spend. To maximize the prob- 
ability of winning at least one lottery, should he buy 50 tickets in one lottery or one 
ticket in 50 lotteries? 

In the previous problem, which of the two strategies will lead to a greater expected 
gain for the player? The expected gain if M(M < 50) lotteries are played is defined 
as Guy = S G,P(i), where G; is the gain obtained in winning # lotteries. 

The switch network shown in Figure P1.53 represents a digital communication link. 
Switches a; 7 = 1,...,6, are open or closed and operate independently. The proba- 
bility that a switch is closed is p. Let A; represent the event that switch 7 is closed. 


ao ag 
1 2 
Law two —3 
1 6 
o_o 
a3 as 


Figure P1.53 Switches in telephone link. 
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(a) In terms of the A;’s write the event that there exists at least one closed path 
from 1 to 2. 

(b) Compute the probability of there being at least one closed path from 1 
to 2. 


(independence of events in disjoint intervals for Poisson law) The average number 
of cars arriving at a tollbooth per minute is \ and the probability of & cars in the 
interval (0,7) minutes is 
: ee [AT]* 
P(k;0,T) =e =a 
Consider two disjoint, that is, nonoverlapping, intervals, say (0,t1] and (t;, 7]. Then 
for the Poisson law: 


Pini cars in (0,¢1] and ng cars in (t1, T]] (1.11-10) 
= P[n, cars in (0, t,]]P[ne cars in (t1,T]], (1.11-11) 


that is events in disjoint intervals are independent. Using this fact, show the following: 

(a) That P[n, cars in (0,t,]|m1 + ne cars in (0,7]] is not a function of X. 

(b) In (a) let T = 2, t; = 1, my = 5, and ng = 5. Compute P[5 cars in 

(0, 1]|10 cars in (0, 2]. 

An automatic breathing apparatus (B) used in anesthesia fails with probability Pg. 
A failure means death to the patient unless a monitor system (/) detects the failure 
and alerts the physician. The monitor system fails with probability Py. The fail- 
ures of the system components are independent events. Professor X, an M.D. at 
Hevardi Medical School, argues that if Pay > Pg installation of MW is useless. t 
Show that Prof. X needs to take a course on probability theory by computing the 
probability of a patient dying with and without the monitor system in place. Take 
Py, = 0.1 = 2Ppz. 
In a particular communication network, the server broadcasts a packet of data 
(say, L bytes long) to N receivers. The server then waits to receive an acknowl- 
edgment message from each of the N receivers before proceeding to broadcast the 
next packet. If the server does not receive all the acknowledgments within a certain 
time period, it will rebroadcast (retransmit) the same packet. The server is then 
said to be in the “retransmission mode.” It will continue retransmitting the packet 
until all N acknowledgments are received. Then it will proceed to broadcast the 


next packet. Let p 2p [successful transmission of a single packet to a single receiver 
along with successful acknowledgment]. Assume that these events are independent 
for different receivers or separate transmission attempts. Due to random impair- 
ments in the transmission media and the variable condition of the receivers, we 
have that p< 1. 


+A true story! The name of the medical school has been changed. 
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(a) In a fixed protocol or method of operation, we require that all N of the 
acknowledgments be received in response to a given transmission attempt for 
that packet transmission to be declared successful. Let the event S(m) be 


defined as follows: S(m) 4 {a successful transmission of one packet to all N 
receivers in m or fewer attempts}. Find the probability 


[Hint: Consider the complement of the event S(m).] 

(b) An improved system operates according to a dynamic protocol as follows. 
Here we relax the acknowledgment requirement on retransmission attempts, 
so as to only require acknowledgments from those receivers that have not yet 
been heard from on previous attempts to transmit the current packet. Let 
S'p(m) be the same event as in part (a) but using the dynamic protocol. Find 
the probability 

Pp(m) 2 P[Sp(m). 


[Hint: First consider the probability of the event Sp(m) for an individual 
receiver, and then generalize to the N receivers.| 


Note: If you try p = 0.9 and N =5 you should find that P(2) < Pp(2). 

Toss two unbiased dice (each with six faces: 1 to 6), and write down the sum of 
the two face numbers. Repeat this procedure 100 times. What is the probability of 
getting 10 readings of value 7? What is the Poisson approximation for computing 
this probability? (Hint: Consider the event A = {sum = 7} on a single toss and let 
p in Equation 1.9-1 be P[A].) 

On behalf of your tenants you have to provide a laundry facility. Your choices 
are 


1. lease two inexpensive “Clogger” machines at $50.00/month each; or 
2. lease a single “NeverFail” at $100/month. 


The Clogger is out of commission 40 percent of the time while the NeverFail is out 
of commission only 20 percent of the time. 


(a) From the tenant’s point, which is the better alternative? 
(b) From your point of view as landlord, which is the better alternative? 


In the politically unstable country of Eastern Borduria, it is not uncommon to find 
a bomb onboard passenger aircraft. The probability that on any given flight, a bomb 
will be onboard is 10~?. A nervous passenger always flies with an unarmed bomb 
in his suitcase, reasoning that the probability of there being two bombs onboard is 
10+. By this maneuver, the nervous passenger believes that he has greatly reduced 
the airplane’s chances of being blown up. Do you agree with his reasoning? If not, 
why not? 

In a ring network consisting of eight links as shown in Figure P1.60, there are 
two paths connecting any two terminals. Assume that links fail independently with 
probability g, 0 < q <1. Find the probability of successful transmission of a packet 
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from terminal A to terminal B. (Note: Terminal A transmits the packet in both 
directions on the ring. Also, terminal B removes the packet from the ring upon 
reception. Successful transmission means that terminal B received the packet from 
either direction.) 


A 


Figure P1.60 A ring network with eight stations. 


A union directive to the executives of the telephone company demands that tele- 
phone operators receive overtime payment if they handle more than 5760 calls in an 
eight-hour day. What is the probability that Curtis, a unionized telephone operator, 
will collect overtime on a particular day where the occurrence of calls during the 
eight-hour day follows the Poisson law with rate parameter \ = 720 calls/hour? 
Toss two unbiased coins (each with two sides: numbered 1 and 2), and write down 
the sum of the two side numbers. Repeat this procedure 80 times. What is the prob- 
ability of getting 10 readings of value 2? What is the Poisson approximation for 
computing this probability? 

The average number of cars arriving at a tollbooth is A cars per minute and the 
probability of cars arriving is assumed to follow the Poisson law. Given that five 
cars arrive in the first two minutes, what is the probability of 10 cars arriving in the 
first four minutes? 

An aging professor, desperate to finally get a good review for his course on proba- 
bility, hands out chocolates to his students. The professor’s short-term memory is 
so bad that he can’t remember which students have already received a chocolate. 
Assume that, for all intents and purposes, the chocolates are distributed randomly. 
There are 10 students and 15 chocolates. What is the probability that each student 
received at least one chocolate? 

Assume that code errors in a computer program occur as follows: A line of code 
contains errors with probability p = 0.001 and is error free with probability gq = 
0.999. Also errors in different lines occur independently. In a 1000-line program, 
what is the approximate probability of finding 2 or more erroneous lines? 

Let us assume that two people have their birthdays on the same day if both the 
month and the day are the same for each (not necessarily the year). How many 
people would you need to have in a room before the probability is 4 or greater that 
at least two people have their birthdays on the same day? 
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(sampling) We draw ten chips at random from a semiconductor manufacturing line 
that is known to have a defect rate of 2 percent. Find the probability that more 
than one of the chips in our sample is defective. 

(percolating fractals) Consider a square lattice with N? cells, that is, N cells per side. 
Write a program that does the following: With probability p you put an electrically 
conducting element in a cell and with probability g = 1—p, you leave the cell empty. 
Do this for every cell in the lattice. When you are done, does there exist a continuous 
path for current to flow from the bottom of the lattice to the top? If yes, the lattice 
is said to percolate. Percolation models are used in the study of epidemics, spread of 
forest fires, and ad hoc networks, etc. The lattice is called a random fractal because 
of certain invariant properties that it possesses. Try N = 10, 20, 50; p = 0.1, 0.3, 0.6. 
You will need a random number generator. MATLAB has the function rand, which 
generates uniformly distributed random numbers x; in the interval (0.0, 1.0). If the 
number x; < p, make the cell electrically conducting; otherwise leave it alone. Repeat 
the procedure as often as time permits in order to estimate the probability of perco- 
lation for different p’s. A nonpercolating lattice is shown in Figure P1.68(a); a perco- 
lating lattice is shown in (b). For more discussion of this problem, see M. Schroeder, 
Fractals, Chaos, Power Laws (New York: W.H. Freeman, 1991). 

You are a contestant on a TV game show. There are three identical closed doors 
leading to three rooms. Two of the rooms contain nothing, but the third contains 
a $100,000 Rexus luxury automobile which is yours if you pick the right door. You 
are asked to pick a door by the master of ceremonies (MC) who knows which room 
contains the Rexus. After you pick a door, the MC opens a door (not the one you 
picked) to show a room not containing the Rexus. Show that even without any 
further knowledge, you will greatly increase your chances of winning the Rexus if 
you switch your choice from the door you originally picked to the one remaining 
closed door. 

Often we are faced with determining the more likely of two alternatives. In such a 
case we are given two probability measures for a single sample space and field of 
events, that is, (Q,7,P,) and (Q,F, P:), and we are asked to determine the prob- 
ability of an observed event F in both cases. The more likely alternative is said to 
be the one which gives the higher probability of event E. 

Consider that two coins are in a box; one is “fair” with P,[{H}] = 0.5 and one is 
“biased” with P:[{H}] = p. Without looking, we draw one coin from the box and 
then flip this single coin ten times. We only consider the repeated coin-flips as our 
experiment and so the sample space 2 = { all ten-character strings of H and T}. 
We observe the event E = {a total of four H’s and six T’s}. 


(a) What are the two probabilities of the observed event EF, that is, P;[E] and 
P2{E]? 

(b) Determine the likelihood ratio L SP, [E]/P2[E] as a function of p. (When 
L > 1, we say that the fair coin is more likely. This test is called a likelihood 
ratio test.) 
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P Random Variables 


2.1 INTRODUCTION 


Many random phenomena have outcomes that are sets of real numbers: the voltage u(t), 
at time ft, across a noisy resistor, the arrival time of the next customer at a movie theatre, 
the number of photons in a light pulse, the brightness level at a particular point on the TV 
screen, the number of times a light bulb will switch on before failing, the lifetime of a given 
living person, the number of people on a New York to Chicago train, and so forth. In all 
these cases the sample spaces are sets of numbers on the real line. 

Even when a sample space 2 is not numerical, we might want to generate a new sample 
space from 2 that is numerical, that is, converting random speech, color, gray tone, and so 
forth to numbers, or converting the physical fitness profile of a person chosen at random 
into a numerical “fitness” vector consisting of weight, height, blood pressure, heart rate, 
and so on, or describing the condition of a patient afflicted with, say, black lung disease by 
a vector whose components are the number and size of lung lesions and the number of lung 
zones affected. 

In science and engineering, we are in almost all instances interested in numerical 
outcomes, whether the underlying experiment .7 is numerical-valued or not. To obtain 
numerical outcomes, we need a rule or mapping from the original sample space (2 to the 
real line R'. Such a mapping is what a random variable fundamentally is and we discuss it 
in some detail in the next several sections. 

Let us, however, make a remark or two. The concept of a random variable will enable 
us to replace the original probability space with one in which events are sets of numbers. 
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Thus, on the induced probability space of a random variable every event is a subset of R?. 
But is every subset of R! always an event? Are there subsets of R! that could get us into 
trouble via violating the axioms of probability? The answer is yes, but fortunately these 
subsets are not of engineering or scientific importance. We say that they are nonmeasurable.! 
Sets of practical importance are of the form {x = a}, {w: a < x < bd}, {t: a < a < DB}, 
{a:a<a <b}, {a: a< a < bd}, and their unions and intersections. These five intervals are 
more easily denoted [a], [a,b], (a, b], [a,b), and (a, 6). Intervals that include the end points 
are said to be closed; those that leave out end points are said to be open. Intervals can also 
be half-closed (half-open) too; for example, the interval (a, b] is open on the left and closed 
on the right. The field of subsets of R! generated by the intervals was called the Borel field 
in Chapter 1, Section 4. 

We can define more than one random variable on the same underlying sample space 
Q. For example, suppose that 9 consists of a large, representational group of people in the 
United States. Let the experiment consist of choosing a person at random. Let X denote 
the person’s lifetime and Y denote that person’s daily consumption of cigarettes. We can 
now ask: Are X and Y related? That is, can we predict X from observing Y? Suppose 
we define a third random variable Z that denotes the person’s weight. Is Z related to X 
or Y? 

The main advantage of dealing with random variables is that we can define certain 
probability functions that make it both convenient and easy to compute the probabilities 
of various events. These functions must naturally be consistent with the axiomatic theory. 
For this reason we must be a little careful in defining events on the real line. Elaboration 
of the ideas introduced in this section is given next. 


2.2 DEFINITION OF A RANDOM VARIABLE 


Consider an experiment .7% with sample space 2. The elements or points of , ¢, are the 
random outcomes of .%. If to every ¢ we assign a real number X(¢), we establish a corre- 
spondence rule between ¢ and R!, the real line. Such a rule, subject to certain constraints, 
is called a random variable, abbreviated as RV. Thus, a random variable X(-) or simply X 
is not really a variable but a function whose domain is 2 and whose range is some subset 
of the real line. Being a function, X generates for every ¢ a specific X(¢) although for a 
particular X(¢) there may be more than one outcome ¢ that produced it. Now consider an 
event Ep C O(Es E F ). 

Through the mapping X, such an event maps into points on the real line (Figure 2.2-1). 
In particular, the event {¢: X(¢) < x}, often abbreviated {X < x}, will denote an event of 
unique importance, and we should like to assign a probability to it. As a function of the real 


variable x, the probability P[LX < x] S Fx (a) is called the cumulative distribution function 
(CDF) of X. It is shown in more advanced books [2-1] and [2-2] that in order for F'x (x) 


+See Appendix D for a brief discussion on measure. 
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X(e) 


0 X(¢) R 


Figure 2.2-1 Symbolic representation of the action of the random variable X. 


to be consistent with the axiomatic definition of probability, the function X must satisfy 
the following: For every Borel set of numbers B, the set {¢: X(¢) € B} must correspond to 
an event Hg €.¥; that is, it must be in the domain of the probability measure P. Stated 
somewhat more mathematically, this requirement demands that X can be a random variable 
only if the inverse image under X of every Borel subsets in R!, making up the field .Z1 are 
events. What is an inverse image? Consider an arbitrary Borel set of real numbers B; the set 
of points Eg in Q for which X(¢) assumes values in B is called the inverse image of the set 
B under the mapping X. Finally, all sets of engineering interest can be written as countable 
unions or intersections of events of the form (—co, 2]. The event {¢: X(¢) < x} © F gets 
mapped under X into (—oo,2] € .2 Thus, if X is a random variable, the set of points 
(—o0, a] is an event. 

In many if not most scientific and engineering applications, we are not interested in 
the actual form of X or the specification of the set Q. For example, we might conceive of 
an underlying experiment that consists of heating a resistor and observing the positions 
and velocities of the electrons in the resistor. The set Q is then the totality of positions 
and velocities of all N electrons present in the resistor. Let X be the thermal noise current 
produced by the resistor; clearly X: Q — R! although the form of X, that is, the exceedingly 
complicated equations of quantum electrodynamics that map from electron positions and 
velocity configurations to current, is not specified. What we are really interested in is the 
behavior of X. Thus, although an underlying experiment with sample space 2. may be 
implied, it is the real line R! and its subsets that will hold our interest and figure in our 
computations. Under the mapping X we have, in effect, generated a new probability space 
(R!, #, Px), where R? is the real line, .# is the Borel o-field of all subsets of R! generated 


+The o-field of events defined on Q is denoted by .% The family of Borel subsets of points on R! is 
denoted by .%. For definitions, see Section 1.4 in Chapter 1. 
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by all the unions, intersections, and complements of the semi-infinite intervals (—oo, x], and 
Px is a set function assigning a number Px[A] > 0 to each set A €.#1 

In order to assign certain desirable continuity properties to the function F(a) at 
x = +00, we require that the events {X = oo} and {X = —oco} have probability zero. With 
the latter our specification of a random variable is complete, and we can summarize much 
of the above discussion in the following definition. 


Definition 2.2-1 Let .% be an experiment with sample space 2. Then the real 
random variable X is a function whose domain is 2 that satisfies the following: (i) For 


every Borel set of numbers B, the set Eg S {¢ € 2,X(¢) © B} is an event and (ii) 
P[X = —co] = P[X = +c0] = 0. 

Loosely speaking, when the range of X consists of a countable set of points, X is said 
to be a discrete random variable; and if the range of X is a continuum, X is said to be 
continuous. This is a somewhat inadequate definition of discrete and continuous random 
variables for the simple reason that we often like to take for the range of X the whole 
real line R!. Points in R! not actually reached by the transformation X with a nonzero 
probability are then associated with the impossible event. 


Example 2.2-1 
(random person) A person, chosen at random off the street, is asked if he or she has a 
younger brother. If the answer is no, the data is encoded by random variable X as zero; if 
the answer is yes, the data is encoded as one. The underlying experiment has sample space 
Q = {no, yes}, sigma field .7= [¢,Q, {no}, {yes}], and probabilities P[¢] = 0, P[Q] = 1, 
P{no| = 2 (an assumption), Plyes] = +. The associated probabilities for X are P[d] = 0, 
P[X < oo] = P[Q] = 1, P[X = 0] = Pino] = 2, PLX = 1] = Plyes] = 4. Take any 21, x2 
and consider, for example, the probabilities that X lies in sets of the type [a1, x], ["1, 72), 
or (41, 22]. Thus, 


PB<X <4) = Pld] =0 
P(0 < X < 1] = Pino] = 3 
P< X <2)=P{Q)=1 
P[O0 < X <1] = Plyes] = i, 


and so on. Thus, every set {X = x}, {a < X < ag}, {X < x2}, and so forth is related to 
an event defined on Q. Hence X is a random variable. 


+The extraordinary advantage of dealing with random variables is that a single pointwise function, that. 
is, the cumulative distribution function F'x (x), can replace the set function Px|[-] that may be extremely 
cumbersome to specify, since it must be specified for every event (set) A €.% See Section 2.3. 

*An alternative definition is the following: X is discrete if Fx (a) is a staircase-type function, and X 
is continuous if F'y (x) is a continuous function. Some random variables cannot be classified as discrete or 
continuous; they are discussed in Section 2.5. 
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Example 2.2-2 
(random bus arrival time) A bus arrives at random in [0,7]; let t denote the time of arrival. 
The sample space Q is 2 = {t: t € [0,T]}. A random variable X is defined by 


1, te ai 
’ 4’? 2)’ 


0, otherwise. 


X(t) = 


Assume that the arrival time is uniform over [0,7]. We can now ask and compute what is 
P(|X(t) = 1] or P[X(t) = 0] or PLX(#) < 5]. 


Example 2.2-3 
(drawing from urn) An urn contains three colored balls. The balls are colored white (W), 
black (B), and red (R), respectively. The experiment consists of choosing a ball at random 
from the urn. The sample space is Q = {W,B,R}. The random variable X is 
defined by 


mt, €=WorB, 
x= 15 oa: 


We can ask and compute the probability PX < 21], where x; is any number. Thus, 
{X < 0} = {R}, {2 < X < 4} = {W, B}. The computation of the associated probabilities 
is left as an exercise. 


Example 2.2-4 
(wheel of chance) A spinning wheel and pointer has 50 sectors numbered n = 0,1,...,49. 
The experiment consists of spinning the wheel. Because the players are interested only in 
even or odd outcomes, they choose 2 = {even, odd} and the only events in the o-field 
are {¢,Q, even, odd}. Let X = n, that is, if n shows up, X assumes that value. Is X a 
random variable? Note that the inverse image of the set {2,3} is not an event. Hence 
X is not a valid random variable on this probability space because it is not a function 
on 2. 


2.3 CUMULATIVE DISTRIBUTION FUNCTION 


In Example 2.2-1 the induced event space under X includes {0,1}, {0}, {1}, ¢, for which 
the probabilities are P[X = 0 or 1] =1, P[X = 0] = 3, P[X = 1] = §, and P[¢] = 0. From 
these probabilities, we can infer any other probabilities such as P[X < 0.5]. In many cases 
it is awkward to write down P|-] for every event. For this reason we introduce a pointwise 
probability function called the cumulative distribution function CDF. The CDF is a function 
of x, which contains all the information necessary to compute P[E] for any F in the Borel 
field of events. The CDF, Fx (), is defined by 


Fx (2) = P[{¢: X(¢) < 2}] = Px[(-o, a]. (2.3-1) 
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Equation 2.3-1 is read as “the set of all outcomes ¢ in the underlying sample space such 
that the function X(¢) assumes values less than or equal to x.” Thus, there is a subset of 
outcomes {¢: X(¢) < x} C Q that the mapping X(-) generates as the set [—oo,2] C Rl. 
The sets {¢: X(¢) < x} CQ and [-o0,2] C R’ are equivalent events. We shall frequently 
leave out the dependence on the underlying sample space and write merely P[X < a] or 
Pla< X <0). 

For the present we shall denote random variables by capital letters, that is, X, Y, Z, 
and the values they can take by lowercase letters x, y, z. The subscript X on F'x (a) asso- 
ciates it with the random variable for which it is the CDF. Thus, F'x(y) means the CDF 
of random variable X evaluated at the real number y and thus equals the probability 
P|X < yj. If Fx(x) is discontinuous at a point, say, %, then Fx (x) will be taken to 
mean the value of the CDF immediately to the right of x,(we call the continuity from the 
right). 


Properties! of Fx (x) 
(i) Fx (oo) — i Fx (oo) =.0: 
(ii) a1 <a > Fx (a1) < Fx (a2), that is, F(x) is a nondecreasing function of x. 
(iii) Fx (a) is continuous from the right, that is, 


Fx (a) = lim Fx(a + €) e>0. 


Proof of (ii) Consider the event {a1 < X < x2} with x2 > x. The set [x 1, x9] is 
nonempty and € .#%. Hence 
0< Play < X <a] <1 


But 
{X < ao} ={X <ay}uf{ay <X <2} 
and 
Hence 
Fx (x2) = Fx (x4) + Pla1 <X< x9] 
or 


Play < X < x] = Fx (x2) — Fx (a1) > 0 for v2 > 24. (2.3-2) 


We leave it to the reader to establish the following results: 


Pla< X <b] = Fx(b) — Fx(a) + P[X = aj; 
Pla< X <b] = Fx(b) — P[X = }] — F x(a); 
Pla< X <b] = Fx(b) — P[X = b] — Fx(a) + Pla = al. 


+Properties (i) and (iii) require proof. This is furnished with the help of extended axioms in Chapter 8. 
Also see Davenport [2-3, Chapter 4]. 
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Example 2.3-1 
(parity bits) The experiment consists of observing the voltage X of the parity bit in a word 
in computer memory. If the bit is on, then X = 1; if off then X = 0. Assume that the off 
state has probability q and the on state has probability 1 — g. The sample space has only 
two points: 2 = {off, on}. 


Computation of F'x (x) 


(i) « <0: The event {X < «} = @ and Fx(x) =0. 
(ii) O <a <1: The event {X < x} is equivalent to the event {off} and excludes the 
event {on}. 


Hence Fx (x) = q. 
(iii) x > 1: The event {X < x} = is the certain event since 


The solution is shown in Figure 2.3-1. 


Example 2.3-2 
(waiting for a bus) A bus arrives at random in (0,7]. Let the random variable X denote 
the time of arrival. Then clearly Fx (t) = 0 for t < 0 and F(T) = 1 because the former 
is the probability of the impossible event while the latter is the probability of the certain 
event. Suppose it is known that the bus is equally likely or uniformly likely to come at any 
time within (0, 7]. Then 


0, t<0, 
t 
Fx(t)=} a 0<tST, 53) 
1, t>T 
Fy(x) 
1 —$____——. 
I 
q 


Figure 2.3-1 Cumulative distribution function associated with the parity bit observation experiment. 
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Fy(t) 


0 T t 


Figure 2.3-2 Cumulative distribution function of the uniform random variable X of Example 2.3-2. 


Actually Equation 2.3-3 defines “equally likely,” not the other way around. The CDF is 
shown in Figure 2.3-2. In this case we say that X is uniformly distributed. 


If Fx (x) is a continuous function of 7, then 
Fx (x) = Fx(a7). (2.3-4) 
However, if Fy (a) is discontinuous at the point 2, then, from Equation 2.3-2, 


Fy (a) — Fx(a~) = Pla < X <a] 


= lim Pila-e<X <a 
e—0 
2 PIX =a]. (2.3-5) 


Typically P[|X = 2] is a discontinuous function of x; it is zero whenever F'x (a) is continuous 
and nonzero only at discontinuities in F’y (x). 


Example 2.3-3 
(binomial distribution function) Compute the CDF for a binomial random variable X with 
parameters (n, p). 


Solution Since X takes on only discrete values, that is, X € {0,1,2,...,n}, the event 
{X < x} is the same as {X < [a]}, where [2] is the largest integer equal to or smaller 
than z. Then Fy (x) is given by the stepwise constant function 


Fx (x) = 3 (7) p(1—p)4, 


For p = 0.6, n = 4, the CDF has the appearance of a staircase function as shown in 
Figure 2.3-3. 
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FAx) 1.000 e—_— 


Figure 2.3-3 Cumulative distribution function for a binomial RV with n= 4, p= 0.6. 


Example 2.3-4 
(computing binomial probabilities) Using the results of Example 2.3-3, compute the following: 


(a) P[15< xX <3}; 
) PO< X <3); 
(c) P[l.2<X <18]; 
) P[1.99 << 3]. 


(a) P[1.5 <X <3] = Fx(3)— P[X =3] — Fx(15) 

= 0.8704 — 0.3456 — 0.1792 = 0.3456; 

(b) P[O < X <3] = Fx(3) — Fx(0) + P[X =0] 

= 0.8704 — 0.0256 + 0.0256 = 0.8704; 

(c) P[l.2<X < 1.8] = Fx(1.8) — Fx(1.2) 

= 0.1792 — 0.1792 = 0; 

(d) P[1.99 < X <3] = Fx(3) — PLX =3] — Fx (1.99) + P[X = 1.99] 
= 0.8704 — 0.3456 — 0.1792 + 0 = 0.3456 


Note that even for a discrete RV, we have taken the CDF to be a function of a continuous 
variable, x in this example. However, for a discrete RV, it is sometimes simpler (but more 
restrictive) to consider the CDF to be discrete also. Let X be a discrete RV taking on values 
{x;} with probability mass function (PMF) Px(ax,). Then the discrete CDF would only be 
defined on the values {z;} also. Assuming that these values are an increasing set, that is, 
Le < Lp+1 for all k, the discrete CDF would be 


k 
Fx(rz) = S> P(a;) for all k. 


j=—o0 


In this format, we compute the CDF only at points corresponding to the countable outcomes 
of the sample space. 


88 Chapter 2. Random Variables 


Looking again at the binomial example b(n, p) above, but using the discrete CDF, we 

would say the RV K takes on values in the set {0 < k < 4} with the discrete CDF 
k Pi 
Fx(k) = 5-06)’ (1-0.6)"-4 for O< <4. 
j=0 

While this is more natural for a discrete RV, the reader will note that the discrete CDF 
cannot be used to evaluate probabilities such as P[1.5 < K < 3] since it cannot be evaluated 
at 1.5. For this reason, we generally will consider CDFs as defined for a continuous domain, 
even though the RV in question might be discrete valued. 


2.4 PROBABILITY DENSITY FUNCTION (pdf) 


If Fx (a) is continuous and differentiable, the pdf is computed from 


fx(x) = ae (2.4-1) 
Properties. If fx(x) exists, then 
(i) fx(x) = 0. (2.4-2) 
Gi) [fx l@dg = Fx(o0) ~ Fx(-00) = 1. (2.438) 
Git) P(e) =f fe(@de = PIX < a. (2.4-4) 


(iv) Fx(ea) — F(a) = ff ” e(é)dé — / ” px(e)de 
. / Sie Pin eR Sa (2.45) 


Interpretation of f x(x). 
Pla< X <a+Aa] = Fy(x@+ Az) — Fx(2). 


If Fx (ax) is continuous in its first derivative then, for sufficiently small Az, 
rtAnz 


Fy(e+Ae)~Fe(0)= ffl) dé ~ fx(o) Ae. 


x 


Hence for small Agr 
Pla< X <a+Aa)~ fx(ax)Ac. (2.4-6) 


Observe that if fx (x) exists, meaning that it is bounded and has at most a finite number of 
discontinuities then Fx (a) is continuous and therefore, from Equation 2.3-5, P[X = a] = 0. 
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The univariate Normal (Gaussian') pdf. The pdf is given by 
L_ yest 
fx(x) = e ?2le 1 ,-co <a < +00. (2.4-7) 


V 2102 


There are two distinct parameters: the mean ys and the standard deviation o(> 0). (Note 
that o? is called the variance). We show that this density is valid by integrating over all x 


as follows 
i a 7 (224) d 
x x 
_o« W210 ate or 
oy 


1 aoe y? A 
= GE i e ? dy, with the substitution y = ; 
—co 


oO 
a. (Fe? 2 / 2 
= — | e 2 dy — x = vr = 1, 
V2r Jo J2r V2  2/r 


where we make use of the known integral 


Co 2 
| e 2dx= a 
0) 2 


Now the Gaussian (Normal) random variable is very common in applications and a special 
notation is used to specify it. We often say that X is distributed as N(u,o07) or write 
X : N(p,07) to specify this distribution.* 

For any random variable with a well-defined pdf, we can in general compute the mean 
and variance (the square of the standard deviation), if it exists, from the two formulas 


bl & /- ufx(a)dx (2.4-8) 
and 
oF ia (2 — p)* fx (x)dz. (2.4-9) 


We will defer to Chapter 4 the proof that the parameters we call jz and o? in the Gaussian 
distribution are actually the true mean and variance as defined generally in these two 
equations. 

For discrete random variables, we compute the mean and variance from the sums 


pS SY aPx(ai) (2.4-10) 


t+ After the German mathematician/physicist Carl F. Gauss (1777-1855). 
'The reader may note that capital letter on the word Normal. We use this choice to make the reader 
aware that while Gaussian or Normal is very common, it is not normal or ubiquitous in the everyday sense. 
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and 


co 


S> (aj — 4)? Px(a). (24-11) 


i=—0O 


I> 


Here are some simple examples of the computation of mean and variance. 


Example 2.4-1 
Let fx(x) = 1, for 0 < x < 1 and zero elsewhere. This pdf is a special case of the uniform 
law discussed below. The mean is computed as 


p= [- rfx(a)de= [ vde=05 


—co 


and the variance is computed as 
oo 1 
oO = / (x — p)’ fxda = | (a — 0.5)?da = 1/12. 
a 0 


Example 2.4-2 
Suppose we are given that Py(0) = Px(2) = 0.25, Px(1) = 0.5, and zero elsewhere. For 
this discrete RV, we use Equations 2.4-10 and 2.4-11 to obtain 


w=0x0.25+1x05+4+2 x 0.25=1 


and 


o* = (0—1)? x 0.254 (1-1)? x 0.54 (2-1)? x 0.25 = 0.5. 


The mean and variance are common examples of statistical moments, whose discussion 
is postponed till Chapter 4. The Normal pdf is shown in Figure 2.4-1. 

The Normal pdf is widely encountered in all branches of science and engineering as 
well as in social and demographic studies. For example, the IQ of children, the heights of 
men (or women), and the noise voltage produced by a thermally agitated resistor are all 
postulated to be approximately Normal over a large range of values. 


Fy (x) 


mn x 


Figure 2.4-1 The Normal pdf. 
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Conversion of the Gaussian pdf to the standard Normal. Suppose we are given X: 
N(u,07) and must evaluate Pla < X < b]. We have 


b 
= 5 / Palaces 
TO a 


With 6 a (x — p)/o, dB = (1/o)dz, U a (b— p)/o, a’ 4 (a — p)/o, we obtain 


Pla< X <bh= 


The function 
e2 dt (2.4-12) 
is sometimes called the error function [erf(x)] although other definitions of erf(x) exist. 


The erf(a) is tabulated in Table 2.4-1 and is plotted in Figure 2.4-2. 
Hence if X:N(y, 07), then 


Pla< xX <b)=ef(—) ert (S$—#). (2.4-13) 


Example 2.4-3 
(resistor tolerance) Suppose we choose a resistor with resistance R from a batch of resistors 
with parameters 4 = 1000 ohms and o = 200 ohms. What is the probability that R will 
have a value between 900 and 1100 ohms? 


Solution Assuming that R: N[1000, (200)?] we compute from Equation 2.4-13 
P[900 < R < 1100] = erf(0.5) — erf(—0.5). 
But erf(—x) = —erf(x) (deduced from Equation 2.4-12). Hence 


P[900 < R < 1100] = 0.38. 


+For example, a widely used definition of erf(a) is erf2(a) 4 (2//m) fo e-' dt, which is used in 
MATLAB. The relation between these two erf’s is erf(x) = serfy (x/V/2). 
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Table 2.4-1 Selected Values of erf(x) 


erf(x) = al exp (-3") dt 


x erf(z) x erf(x) 
0.05 0.01994 2.05 0.47981 
0.10 0.03983 2.10 0.48213 
0.15 0.05962 2.15 0.48421 
0.20 0.07926 2.20 0.48609 
0.25 0.09871 2.25 0.48777 
0.30 0.11791 2.30 0.48927 
0.35 0.13683 2.35 0.49060 
0.40 0.15542 2.40 0.49179 
0.45 0.17364 2.45 0.49285 
0.50 0.19146 2.50 0.49378 
0.55 0.20884 2.55 0.49460 
0.60 0.22575 2.60 0.49533 
0.65 0.24215 2.65 0.49596 
0.70 0.25803 2.70 0.49652 
0.75 0.27337 2.75 0.49701 
0.80 0.28814 2.80 0.49743 
0.85 0.30233 2.85 0.49780 
0.90 0.31594 2.90 0.49812 
0.95 0.32894 2.95 0.49840 
1.00 0.34134 3.00 0.49864 
1.05 0.35314 3.05 0.49884 
1.10 0.36433 3.10 0.49902 
1.15 0.37492 3.15 0.49917 
1.20 0.38492 3.20 0.49930 
1.25 0.39434 3.25 0.49941 
1.30 0.40319 3.30 0.49951 
1.35 0.41149 3.35 0.49958 
1.40 0.41924 3.40 0.49965 
1.45 0.42646 3.45 0.49971 
1.50 0.43319 3.50 0.49976 
1.55 0.43942 3.55 0.49980 
1.60 0.44519 3.60 0.49983 
1.65 0.45052 3.65 0.49986 
1.70 0.45543 3.70 0.49988 
1.75 0.45993 3.75 0.49990 
1.80 0.46406 3.80 0.49992 
1.85 0.46783 3.85 0.49993 
1.90 0.47127 3.90 0.49994 
1.95 0.47440 3.95 0.49995 
2.00 0.47724 4.00 0.49996 
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Figure 2.4-2  erf(x) versus x. 


Using Figure 2.4-3 as an aid in our reasoning, we readily deduce the following for 
X: N(0,1). Assume a > 0; then 


P[X <al= ; +erf(x), (2.4-14a) 

P[|X > -a] = ; +erf(x), (2.4-14b) 
P|X >a] = ; —erf(x), (2.4-14c) 
Pl-x < X <a] =2erf(z), (2.4-14d) 
PI|X| > a] =1—2erf(z). (2.4-14e) 


Example 2.4-4 
(manufacturing) A metal rod is nominally 1 meter long, but due to manufacturing imper- 
fections, the actual length L is a Gaussian random variable with mean yw = 1 and standard 
deviation ¢ = 0.005. What is the probability that the rod length ZL lies in the interval 
(0.99, 1.01]? Since the random variable L:N(1, (0.005)?), we have 
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fy (x) Fy (x) 
D 
Z 
Z 
Z 
ZZ 
LZ 
= (b) (c) * 
fy (x) Fy (x) 
—xX ii x —xX ia x 


Figure 2.4-3_ The areas of the shaded region under curves are (a) P[X < x]; (b) P[X > —x]; (c) 
P[X > x]; (d) P[l-x < X < x]; and (e) P[|X] > x]. 


ce 1 —1 (221.00 
€ 


2 
3 (“S005 ) dx 


P|0.99 < L < 1.01 = 
| _ | 0.99 V 27r(0.005) 


1.01—1.00 
0.005 


Wie 
0.99—1.00 4/ 
0.005 an 


19 
e 2” dx 
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2 © 
=| : e723? dex 
219 20 


= 2erf(2)=2x 0.4772 (from Table 2.4-1) 


Four Other Common Density Functions 


1. Rayleigh (o > 0): 


fx(a) = age 7" uln), (2.4-15) 


where the continuous unit-step function is defined as 


@i 1,0<%<oM, 
TIS) 0-0 <2 <0. 


Thus, fx(x) = 0 for « < 0. Examples of where the Rayleigh pdf shows up are in rocket- 
landing errors, random fluctuations in the envelope of certain waveforms, and radial distri- 
bution of misses around the bull’s-eye at a rifle range. 

2. Exponential (4 > 0): 


fx(x) = ! etl itasfeay, (2.4-16) 
iv 


The exponential law occurs, for example, in waiting-time problems, in calculating lifetime 
of machinery, and in describing the intensity variations of incoherent light. 
3. Uniform (b> a): 


a<a<b 
=0 otherwise. (2.4-17) 


The uniform pdf is used in communication theory, in queueing models, and in situations 
where we have no a priort knowledge favoring the distribution of outcomes except for the 
end points; that is, we don’t know when a business call will come but it must come, say, 
between 9 A.M. and 5 P.M. We sometimes use the notation U(a,b) to denote a uniform 
distribution lower-bounded by a and upper-bound by b. 

The three pdf’s are shown in Figure 2.4-4. 

4. Laplacian: The pdf is defined by 


fx(x) = ok -—o0<@2<co c>0. (2.4-18) 


The Laplacian is widely used in speech and image processing to model adjacent-sample 
difference and is the difference in signal level from a sample point and its neighbor. Since 
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Exponential 


Rayleigh 


Uniform 


0 o a b x 


Figure 2.4-4 The Rayleigh, exponential, and uniform pdf's. 


fy (x) 


x 


Figure 2.4-5 The Laplacian pdf used in computer analysis of speech and images. 


the levels of the sample point and its neighbor are often the same, the Laplacian peaks at 
zero. The Laplacian pdf is sometime written as 


fx(x) = exp[—V2|a|/a], -o00<<a<co o >O, (2.4-19) 


1 
J2o 
where o is the standard deviation of the Laplacian RV X. Precisely what this means will be 
explained in Chapter 4. The Laplacian pdf is shown in Figure 2.4-5. In image compression, 
the Laplacian model is appropriate for the so-called “AC coefficients” that arise after a 
decorrelating transform called the DCT? which is applied on 8 x 8 blocks of pixels. 


Example 2.4-5 
(radiated power) The received power W ona cell phone at a certain distance from the base 
station is found to follow a Rayleigh distribution with parameter o = 1 milliwatt. What 


*DCT stands for discrete cosine transform and is a variation on the DFT used in signal analysis. A 
2-D version is used for images, consisting of a 1-D DCT on the rows followed by a 1-D transform on the 
columns. 
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is the probability that the power W is less than 0.8 milliwatts? Since the power can be 
modeled by the Rayleigh random variable, we have 


0.8 5 
P[W < 0.8] = ih ze 2dzx, since o? =1, 
0 


0.32 
1 
= | e Ydy, with the substitution y 4 ae 
0.0 


=1—¢ 9? ~ 0.29. 


Example 2.4-6 
(image compression) In designing the quantizer for a JPEG image compression system, 
we need to know what the range should be for the transformed AC coefficients. Using the 
Laplacian model with parameter o for such a coefficient X, what is the probability of the 
event {|X| > ko} as a function of k = 1,2,3,...? If we then make this probability sufficiently 
low, by choice of k, we will design the quantizer for the range [—ka,+ko] and only saturate 
the quantizer occasionally. We need to calculate 


P||X| > ko] = [- ayes (-v2z/c) dz + [- = exp (+v2z/c) dx 


ae 

= exp (-v2z/c) dx 
ko 20 

= ) eee exp (-v2y) dy with y & u/o 
nr V2 


= 2( : exp (-v2y) 


= exp (-v2k) : 


For k = 2, we get probability 0.059 and for k = 5 we get 0.85 x 107%, or about one in a 
thousand coefficients. 


Table 2.4-2 lists some common continuous random variables with their probability densi- 
ties and distribution functions. 


More Advanced Density Functions 


5. Chi-square (n an integer) 


fx(a) = Kyr(2)-1e-Fu(a), (2.4-20) 
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Table 2.4-2. Common Continuous Probability Densities and Distribution Functions 


Family pdf fx (x) CDF Fx (a) 
0, u<a, 
Uniform U(a, b) az [u(@ — a) — u(x — d)] af ,asaK<b, 
ae b<u 
Exponential pz > 0 Le-t/# u(x) { B bot 
Pp a ii 1-e*/#, «>0 
: eis? i 
Gaussian N(j1,07) es exp[—5 (=#)"] + +erf(=*) 
Laplacian o > 0 Sas exp[—V2|a|/o] $[1+sgn(x)(1 — exp(—vV2|2|/o))] 
Rayleigh o > 0 eae CD) [1 ag ee u(x) 


The Chi-square density for n = 2, 4, 10 


pdf value 


¥xy 


0 == hey, . ” a e 
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Argument value 


Figure 2.4-6 The Chi-square probability density function for n = 2 (solid), n = 4 (dashed), and n = 10 
(stars). Note that for larger values of n, the shape approaches that of a Normal pdf with a positive 


mean-parameter [u. 


where the normalizing constant Ay is computed as Ky, = ICT EN and ['(-) is the Gamma 
function discussed in Appendix B. The Chi-square pdf is shown in Figure 2.4-6. 
6. Gamma: (b > 0,c > 0) 


fx (a) = K,x°te-u(a), (2.4-21) 


where K., = c?/T(b). 
7. Student-t: (n an integer) 


fx(z) = Ket (1 + =) OO << (2.4-22) 
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Beta pdf 


pdf value 
No 


T T 
-0.5 4D 0.2 0.4 0.6 0.8 1 1.2 
argument value 


Figure 2.4-7 The beta pdf shown for G = 1,a@ = n — 2, and various values of n. When @ = 0,a = 0 
the beta pdf becomes uniformly distributed over 0 <x <1. 


where 


= T[(n + 1)/2] 
Aet= Dinan 


The Chi-square and Student-t densities are widely used in statistics.’ We shall encounter 
these densities later in the book. The gamma density is mother to other densities. For 
example with b = 1, there results the exponential density; and with b = n/2 and c = 1/2, 
there results the Chi-square density. 

8. Beta (a> 0,6 > 0): 


(o+8+1)! .0(1 _ 7)8. 9 1 
: _ ala (t= ge. ecu I, 
Fx(%; a, 8) 0 élee. 


The beta distribution is a two-parameter family of functions that appears in statistics. It is 
shown in Figure 2.4-7. 

There are other pdf’s of importance in engineering and science, and we shall encounter 
some of them as we continue our study of probability. They all, however, share the properties 
that 


+The Student-t distribution is so named because its discoverer W. S. Gossett (1876-1937) published his 
papers under the name “Student.” Gossett, E. S. Pearson, R. A. Fisher, and J. Neyman are regarded as 
the founders of modern statistics. 
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fx(x) 20 (2.4-23) 


/- fx(a)dx = 1. (2.4-24) 


When F'x(x) is not continuous, strictly speaking, its finite derivative does not exist and, 
therefore, the pdf doesn’t exist. The question of what probability function is useful in 
describing X depends on the classification of X. We consider this next. 


2.5 CONTINUOUS, DISCRETE, AND MIXED RANDOM VARIABLES 


If Fx (a) is continuous for every x and its derivative exists everywhere except at a countable 
set of points, then we say that X is a continuous RV. At points « where F4(z) exists, the 
pdf is fx(x) = F(a). At points where F'x(x) is continuous, but F(a) is discontinuous, 
we can assign any positive number to fx(«); fx(x) will then be defined for every 7, and 
we are free to use the following important formulas: 


Fez) =f telOds, (2.5-1) 
Pl <X<al= [” px(eue, (2.5-2) 

and 
PBl=f_ Ax(@u, (2.5-3) 


where, in Equation 2.5-3, B € .%, that is, B is an event. Equation 2.5-3 follows from the fact 
that for a continuous random variable, events can be written as a union of disjoint intervals 
in R. Thus, for example, let B = {€: € € UL, LI; = o fori F j}, where I; = (aj, di]. 
Then clearly, 


by bo Dn 
P[p|= | fx(éae + fa(edg+...+ f fx(©adé 


ay a2 


= [ “cep OM (2.5-4) 


A discrete random variable has a staircase type of distribution function (Figure 2.5-1). 
A probability measure for discrete RV is the probability mass function! (PMF). The 
PMF Px(«) of a (discrete) random variable X is defined as 


+Like mass, probability is nonnegative and conserved. Hence the term mass in probability mass function. 


Sec. 2.5. CONTINUOUS, DISCRETE, AND MIXED RANDOM VARIABLES 101 


F(x) 


x 


Figure 2.5-1 The cumulative distribution function for a discrete random variable. 


(2.5-5) 
= P(X <a]— PX < gl. 


Thus, Px (a) = 0 everywhere where Fx (a) is continuous and has nonzero values only where 
there is a discontinuity, that is, jump, in the CDF. If we denote P[X < a] by Fx(a7), 
then at the jumps x;, i = 1,2,..., the finite values of Py(a;) can be computed from 
Px(ax;) = Fx (x;) = Fx (a; ). 

The probability mass function is used when there are at most a countable set of outcomes 
of the random experiment. Indeed Px(a;) lends itself to the following frequency interpre- 
tation: Perform an experiment n times and let n; be the number of tries that x2; appears as 
an outcome. Then, for n large, 

Ny 


Px (xi) = =. (2.5-6) 


Because the PMF is so closely related to the frequency notion of probability, it is sometimes 
called the frequency function. 

Since for a discrete RV Fx (x) is not continuous fx (x), strictly speaking, does not exist. 
Nevertheless, with the introduction of Dirac delta functions,’ we shall be able to assign pdf’s 
to discrete RVs as well. The CDF for a discrete RV is given by 


Fx(x) 2 P[X <alJ= S> Px(ai) (2.5-7) 


all aj<a 


and, more generally, for any event B when X is discrete: 


Pal= YP (2.5-8) 


all 7;E€B 


+ Also called impulses or impulse functions. Named after the English physicist Paul A. M. Dirac (1902— 
1984). Delta functions are discussed in Section B.2 of Appendix B. 
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Some Common Discrete Random Variables 


1. Bernoulli random variable B with parameter p (0 < p< 1,q = p): 


q, k = 0, 
Pa(k)=4 7p, k=1, (2.5-9) 
0, else, 


= q0(k) + po(k —1), _ by use of discrete delta function’ 5(k).  (2.5-10) 


The Bernoulli random variable appears in those situations where the outcome is one of 
two possible states, for example, whether a particular bit in a digital sequence is “one” or 
“zero.” The Bernoulli PMF can be conveniently written as Pg(k) = p*q'~* for k = 0 or 1 
and then zero elsewhere. The corresponding CDF is given as 


0,k <0, 
Fp(k) = dq; k= 0, 
isk 1, 

= qu(k) + pu(k — 1) by use of unit-step function u(k). 


2. Binomial random variable K with parameters n and p (n = 1,2,...;0 < p < 1) 
and / an integer: 


Px(k) = @ pig, 0<k<a, (2.5-11) 
0, else, 
= @ p*q”—* [u(k) — u(n — k)] . (2.5-12) 


The binomial random variable appears in games of chance, military defense strategies, 
failure analysis, and many other situations. Its corresponding CDF is given as (1,k,n are 
integers) 


0, k <0, 
Fe(h) =) Dito (7) oat 0s k <n, 
ile k>n. 
3. Poisson random variable X with parameter ju(> 0) and k an integer: 
Px(k) = { ne — (2.5-13) 


The Poisson law is widely used in every branch of science and engineering (see Section 1.10). 
We can write the Poisson PMF in a single line by use of the unit-step function u(k) as 


¥ Recall that the discrete delta function has value 1 when the argument is 0 and has value 0 for every 
other value. 
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where the discrete unit-step function is defined by 


4. Geometric random variable K with parameters p > 0,qg > 0,(p+q=1) and k an 
integer: 


The corresponding CDF is given by a finite sum of the geometric series (ref. 
Appendix A) as 


This distribution’ was first seen in Example 1.9-4. As there, also note the variant pq”~!,n > 
1, also called geometric RV. 


Example 2.5-1 
(CDF of Poisson RV) Calculating the CDF of a Poisson random variable proceeds as 


follows. Let X be a Poisson random variable with parameter j1(>0). Then by definition the 
k 


PMF is Px(k) = HeHulk). Then the CDF F'x(k) =0 for k < 0. For k > 0, we have 


k 
Fx(k) => Te 


I 
— 
TM= 

ss] 52 
— 
® 

a 


Table 2.5-1 lists the common discrete RVs, their PMFs, and their CDFs. 

Sometimes an RV is neither purely discrete nor purely continuous. We call such an RV 
a mized RV. The CDF of a mixed RV is shown in Figure 2.5-2. Thus, Fx (a) is discontinuous 
but not a staircase-type function. 


+Note that we sometimes speak of the probability distribution in a general sense without meaning the 
distribution function per se. Here we give a PMF to illustrate the geometric distribution. 
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Table 2.5-1 Table Common Discrete RVs, PMFs, and CDFs 


Family PMF  Px(k) CDF Fx (k) 
Bernoulli p,q qo(k) + po(k — 1) qu(k) + pu(k — 1) 
0, k <0, 
Binomial n,k @ p*q”—* [u(k) — u(n — k)| ~ & pq? ',0<k<n 
iL, k>n. 
t 
Poisson p> 0 We elk) URE La)” x u(k) 
: k 1l-q : 
Geometric p,q pq u(k) D 2s u(k) 


x 


Figure 2.5-2. The CDF of a mixed RV. 


The distinction between continuous and discrete RVs is somewhat artificial. Continuous 
and discrete RVs are often regarded as different objects even though the only real difference 
between them is that for the former the CDF is continuous while for the latter it is not. By 
introducing delta functions we can, to a large extent, treat them in the same fashion and 
compute probabilities for both continuous and discrete RVs by integrating pdf’s. 

Returning now to Equation 2.5-7, which can be written as 


-> Px (ax;)u(x — x4), (2.5-14) 


and using the results from the section on delta functions in Appendix B enables us to write 
for a discrete RV 


f(a) = 22) _ ¥~ Py(a,)s(e ai) (2.5-15) 


1=— CO 


+See Appendix B for a definition of the incomplete gamma. 
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F(x) 


1.0 —— 


(a) 


fx (x) 0.65(x—1) 


0.28(x) 0.28(x— 3) 


(b) 


Figure 2.5-3 (a) CDF of a discrete RV X; (b) pdf of X using delta functions. 


Example 2.5-2 
(practice example) Let X be a discrete RV with distribution function as shown in Figure 2.5- 
3(a). The pdf of X is 


Pia= ox = 0.26(z) + 0.6(a — 1) +0.26(a — 3) 


and is shown in Figure 2.5-3(b). To compute probabilities from the pdf for a discrete RV, 
great care must be used in choosing the interval of integration. Thus, 


Fx (2) = i : ixlOU, 


which includes the delta function at x if there is one there. 
Similarly P[z, < X < 29] involves the interval 


{__] 


xy XQ 


and includes the impulse at x2 (if there is one there) but excludes what happens at «;. On 
the other hand Pia, < X < x9] involves the interval 


— 


xX, Xo 
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and therefore 
P(x < X< x2) = 3 fx (€)dé. 


Applied to the foregoing example, these formulas give 


P[X < 1.5] = Fx (1.5) =0.8 (2.5-16) 
Pil< X <3)=02 (2.5-17) 
Pil < X <3] =06. (2.5-18) 


Example 2.5-3 
(Practice example) The pdf associated with the Poisson law with parameter a is 


Example 2.5-4 
(Practice example) The pdf associated with the binomial law b(k; n, p) is 


Example 2.5-5 
(Practice example) The pdf of a mixed RV is shown in Figure 2.5-4. (1) What is the 
constant A’? (2) Compute P[X < 5], P[5 < X < 10]. (3) Draw the distribution function. 


Solution (1) Since 


[. fx (dé = 1, 


we obtain 10K + 0.25+ 0.25 =1= k =0.05. 
(2) Since P[X < 5] = P[X < 5] + P[X = 5], the impulse at 2 = 5 must be included. 
Hence 


5+ 
PIX <5|/= Z [0.05 + 0.255(€ — 5)|dé 
0 


= 0.5. 


To compute P(5 < X < 10), we leave out the impulse at « = 10 but include the impulse at 
x = 5. Thus, 
10- 
PI5< xX <10)= i (0.05 + 0.250(€ — 5)|d&é 


= 0.5. 
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Fy (x) 


0.258(x — 5) 0.258(x —10) 


Figure 2.5-4 (a) pdf of a mixed RV for Example 2.5-5; (b) computed pdf. 


2.6 CONDITIONAL AND JOINT DISTRIBUTIONS AND DENSITIES 


Consider the event C' consisting of all outcomes ¢ € such that X(¢) < wand Ee BCQ, 
where B is another event. Then, by definition, the event C' is the set intersection of the two 
events {¢: X(¢) < x} and {¢: ¢ € B}. We define the conditional distribution function of X 
given the event B as 


a PIC] _ P[X <2,B] 
) PB] PIB)’ ee) 


Fy (2|B 
where P[X < 2, B] is the probability of the joint event {X < x«}MB and P[B] F 0. If 
x = ©, the event {X < oo} is the certain event 2 and since 09 B = B, Fx(oo|B) = 1. 
Similarly, if « = —oo, {X < —oo} = ¢ and since NN ¢ = ¢, Fx (—o0|B) = 0. Continuing in 
this fashion, it is not difficult to show that Fx (|B) has all the properties of an ordinary 
distribution, that is, 7, < @2 — Fy(x,|B) < Fy(x|B). 

For example, consider the event {X < x2, B} and write (assuming a2 > 21) 


{X <a, B) ={X <7, Bh uU{a, <X <2, B}. 
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Since the two events on the right are disjoint, their probabilities add and we obtain 


P[X < x2, B) = P[X < 11, B) + Play < X < 22, 8B] 


P[X < x|B)P[B] = P[X < 2;|B|P[B] + Pilar < X < x2|B)P[B). 
Thus when P[B] 4 0, we obtain after rearranging terms and dividing by B 


Play < X < %2|B] = P[X < x2|B) — P[X < x1|B] 
= Fx (x2|B) — Fx (21|B). (2.6-2) 


Generally the event B will be expressed on the probability space (R,.Z, Py) rather than 
the original space (Q,.% P). The conditional pdf is simply 


fx(o|B) 2 XN) (2.63) 


Following are some examples. 


Example 2.6-1 
(evaluating conditional CDFs) Let B S {X < 10}. We wish to compute Fx (2|B). 


(i) For x > 10, the event {X < 10} is a subset of the event {X < x}. Hence P[X < 
10, X < 2] = P[X < 10] and use of Equation 2.6-1 gives 


PIX <2,X <10 
Px(2|B) = aS Jy 


(ii) For 7 < 10, the event {X < x} is a subset of the event {X < 10}. Hence P[X < 
10, X <2] = P[X < a] and 


PIX <a] 
Fx(2|B) = PIX < 10) 


The result is shown in Figure 2.6-1. We leave as an exercise to the reader to compute 
Fx (2|B) when B= {b < X < a}. 


Sec. 2.6. CONDITIONAL AND JOINT DISTRIBUTIONS AND DENSITIES 109 


Fy(x| B) 


0 10 x 


Figure 2.6-1 Conditional and unconditional CDFs of X. 


Example 2.6-2 
(Poisson conditioned on even) Let X be a Poisson RV with parameter j(>0). We wish 
to compute the conditional PMF and CDF of X given the event {X = 0,2,4,...} 4 
{X (is) even}. First observe that P[X even] is given by 


e = Hou 
P[X =0,2,...) = ze 
k=0,2.... 
Then for X odd, we have 
Pxai3 n= 5 as 
k=1,3,... 
From these relations, we obtain 
k k Ck 
Mooi Moog Lt ko 
a ae ae a give 
k>0 and even k>0 and odd k=0 
oo k 
= oe (=p) onl 
= xi 
k=0 
=e eH 
=> eo 2 
and 
k k 
Le —p Lb THe 
S- Fie I TH © => 1 
k>0 and even k>0 and odd 
Hence P[X even] = P[X = 0,2,...] = $(1+e77"). Using the definition of conditional PMF, 
we obtain 


P|X =k, X even] 
P|X even] 


Px (k|X even) = 
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If k is even, then {X = k} is a subset of {X even}. If k is odd, {X = k}M{X even} = ¢. 
Hence P[X = k, X even] = P[X = k] for k even and it equals 0 for k odd. So we have 


Px (k|X even) = Tpaery ere *, k 2 0 and even, 
0, k odd. 


The conditional CDF is then 


Fx(a|X even)= S > Px(k|X even) 


| 
> 
= 
wo} we 
s 
— 
aS 

| 

. 


and even 


Let us next derive some important formulas involving conditional CDFs and pdf’s. 


The distribution function written as a weighted sum of conditional distribution func- 
tions. Equation 1.6-7 in Chapter 1 gave the probability of the event B in terms of n 
mutually exclusive and exhaustive events {A;},7 = 1,...,n, defined on the same probability 


space as B. With B = {X < x}, we immediately obtain from Equation 1.6-7: 
Fx (2) = > Fx(a|A;)P[Ai). (2.6-4) 
i=1 


Equation 2.6-4 describes F'x(x) as a weighted sum of conditional distribution functions. 
One way to view Equation 2.6-4 is an “average” over all the conditional CDFs.' Since we 
haven’t yet made concrete the notion of average (this will be done in Chapter 4), we ask 
only that the reader accept the nomenclature since it is in use in the technical literature. 


Example 2.6-3 
(defective memory chips) In the automated manufacturing of computer memory chips, 
company Z produces one defective chip for every five good chips. The defective chips (DC) 
have a time of failure X that obeys the CDF 


Fx (2|DC) = (1 — e7*/?)u(x) (a in months) 
while the time of failure for the good chips (GC) obeys the CDF 
Fx(2|GC) = (1—e7*/!)u(x) (a in months). 


The chips are visually indistinguishable. A chip is purchased. What is the probability that 
the chip will fail before six months of use? 


tFor this reason, when Fx (a) is written as in Equation 2.6-4, it is sometimes called the average distri- 
bution function. 
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Solution The unconditional CDF for the chip is, from Equation 2.6-4, 
Fy (x) = Fx(a|DC)P[DC] + Fx (2|GC)P[GCI, 


where P[DC] and P[GC] are the probabilities of selecting a defective and good chip, respec- 
tively. From the given data P[DC] = 1/6 and P[GC] = 5/6. Thus, 


PeGjal-e Erie) 
= 0.158 + 0.376 = 0.534. 


5 
6 


Bayes’ formula for probability density functions. Consider the events B and {X = x} 
defined on the same probability space. Then from the definition of conditional probability, 
it seems reasonable to write 


P[B|X =a] = (2.6-5) 


The problem with Equation 2.6-5 is that if X is a continuous RV, then P[X = a] = 0. 
Hence Equation 2.6-5 is undefined. Nevertheless, we can compute P[|B|X = a] by taking 
appropriate limits of probabilities involving the event {a < X < «+ Az}. Thus, consider 
the expression 


Pla< X <«2+Aa|BIP[B] 
Pla< X <a#+Aqg] 


PIBla< X <x#+Az)] = 


If we (i) divide numerator and denominator of the expression on the right by Az, (ii) use 
the fact that Pla < X < a+ Az|B] = F(a + Az|B) — F(x|B), and (iii) take the limit as 
Ax — 0, we obtain 


PI|B|X =2]= lim P[Bla< X <a#+Az] 
Az—0 


_fx(@lB)PIB] ; 
a f(x) £0. (2.6-6) 


The quantity on the left is sometimes called the a posteriori probability (or a posteriori 


density) of B given X = x. Multiplying both sides of Equation 2.6-6 by fx(a) and inte- 
grating enables us to obtain the important result 


PIB) = ie P(B|X = x] fx(ax)de. (2.6-7) 


In line with the terminology used in this section, P[B] is sometimes called the average 
probability of B, the usage being suggested by the form of Equation 2.6-7. 
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Figure 2.6-2 Based upon observing the signal, the receiver R must decide which switch was closed 


or, equivalently, which of the sources A, B, C was responsible for the signal. Only one switch can be 
closed at the time the receiver is on. 


Example 2.6-4 
(detecting closed switch) A signal, X, can come from one of three different sources designated 
as A, B, or C. The signal from A is distributed as N(—1, 4); the signal from B is distributed 
as N(0,1); and the signal from C has an N(1,4) distribution. In order for the signal to reach 
its destination at R, the switch in the line must be closed. Only one switch can be closed 
when the signal X is observed at R, but it is not known which switch it is. However, it is 
known that switch a is closed twice as often as switch b, which is closed twice as often as 
switch c (Figure 2.6-2). 


(a) Compute P|X < —1); 
(b) Given that we observe the event {X > —1}, from which source was this signal most 
likely? 


Solution (a) Let P[A] denote the probability that A is responsible for the observation 
at R, that is, switch a is closed. Likewise for P[B], P[C]. Then from the information about 
the switches we get P[A] = 2P[B] = 4P[C] and P{A]+ P[B]+ P[C] = 1. Hence P[A] = 4/7, 
P|B] = 2/7, P[C] = 1/7. Next we compute P[X < —1] from 

PIX < -1) = P[X < -1|AJP[A] + PLX < -1|B]P[B] + PLX < -1C|P[C] 


d 


where 
P[X < -1|A] = 1/2 (2.6-8) 
P|X < —1|B] = 1/2 — erf(1) = 0.159 (2.6-9) 
P[X < —1|C] = 1/2 — erf(1) = 0.159. (2.6-10) 


Hence P[X < —1] =1/2 x 4/7 +0.159 x 2/7 + 0.159 x 1/7 ~ 0.354. 

(b) We wish to compute max{P[A|X > —1], P[B|X > —1], P[C|X > —1]}. To enable 
this computation, we note that P[X > —1|A] = 1 —P[X < —1|A], and so on, for B and C. 
Concentrating on source A, and using Bayes’ rule, we get 


PLAX > j= Ee 
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which, using the values already computed, yields P[A|X > —1] = 0.44. 
Repeating the calculation for the other sources, we obtain 


P[B|X > -1] = 0.372, (2.6-11) 
P[C|X > -1] = 0.186. (2.6-12) 


Hence, since the maximum a posteriori probability favors A, source A was the most likely 
cause of the event {X > —1}. 


Poisson transform. An important specific example of Equation 2.6-7 is the so-called 
Poisson transform in which B is the event that a random variable Y takes on an integer 


value k from the set {0,1,...,} that is, B = {Y = k} and X is the Poisson parameter, 
treated here as a random variable with pdf fx(a). The ordinary Poisson law 


(2.6-13) 


where jz is the average number of events in a given interval (time, distance, volume, and 
so forth), treats the parameter as a constant. But in many situations the underlying 
phenomenon that determines ju is itself random and js must be viewed as a random outcome, 
that is, the outcome of a random experiment. Thus, there are two elements of randomness: 
the random value of j and the random outcome {Y = k}. When w is random it seems 
appropriate to replace it by the notation of a random variable, say X. Thus, for any given 
outcome {X = x} the probability P[Y = k|X = a] is Poisson; but the unconditional 
probability of the event {Y = k} is not necessarily Poisson. Because both the number of 
events and the Poisson parameter are random, this situation is sometimes called doubly 
stochastic. From Equation 2.6-7 we obtain for the unconditional PMF of Y 


OO Wk 
Py(k) = | ~e-* fx(x)dt, k>O0. (2.6-14) 
0 
The above Equation is known as the Poisson transform and can be used to obtain fx (x) 


if Py(k) is obtained by experimentation. The mechanism by which fx(a) is obtained from 
Py(k) is the inverse Poisson transform. The derivation of the latter is as follows. Let 


Ih fe 3 
oj — | el’®e—® fy (x) da, (2.6-15) 
27 Jo 
that is, the inverse Fourier transform of e~* fx (x). Since 
ef? = S° [jwa]*/kl, (2.6-16) 
k=0 


we obtain 
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=<) ju*Py(k) (2.6-17) 


Thus, F'(w) is known if Py(k) is known, Taking the forward Fourier transforms of Fw) 
yields 


e "fx(w) =e a F(w)e 9?" dw. 
or 
fx (x) = e* i. F(w)e 7" dw. (2.6-18) 


Equation 2.6-18 is the inverse relation we have been seeking. Thus to summarize: If we know 
Py(k), we can compute F'(w). Knowing F'(w) enables us to obtain fx (x) by a Fourier trans- 
form. We illustrate the Poisson transform with an application from optical communication 
theory. 


Example 2.6-5 
(optical communications) In an optical communication system, light from the transmitter 
strikes a photodetector, which generates a photocurrent consisting of valence electrons 
having become conduction electrons (Figure 2.6-3). 

It is known from physics that if the transmitter uses coherent laser light of constant 
intensity the Poisson parameter X has pdf 


fx(x)=6(@-2%0) 20 >0, (2.6-19) 


where Zo, except for a constant, is the laser intensity. On the other hand, if the transmitter 
uses thermal illumination, then the Poisson parameter X obeys the exponential law: 


fx(z) = =o ua), (2.6-20) 


where ps > 0 is now just a parameter, but one that will later be shown to be the true mean 
value of X. Compute the PMF for the electron-count variable Y. 


Solution For coherent laser illumination we obtain from Equation 2.6-14 


oO ak 
Py(k) = | ze Ole — £9)dx (2.6-21) 
9 =! 
k 
_ oo k>0. (2.6-22) 


Thus, for coherent laser illumination, the photoelectrons obey the Poisson law. For thermal 
illumination, we obtain 
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Figure 2.6-3 (a) Optical communication system; (b) output current from photodetector. 


Py(k) =a ee */Hdy 
) 


a 
= ‘ A 
a | zee-*dz, with z=a/a, 
0 


= —TI(k+1), where I denotes the Gamma function (see Appendix B), 


= kl 


k 
2 tal 
~ teen =” 


(2.6-23) 


This PMF law is known as the geometric distribution and is sometimes called Bose-Einstein 
statistics [2-4]. It obeys the interesting recurrence relation 
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Pye(k+1)= abe lh). (2.6-24) 


1+ 


Depending on which illumination applies, the statistics of the photocurrents are widely 
dissimilar. 


Joint distributions and densities. As stated in Section 2.1, it is possible to define more 
than one random variable on a probability space. For example, consider a probability space 
(Q,.4%P) involving an underlying experiment consisting of the simultaneous throwing of 
two fair coins. Here the ordering is not important and the only elementary outcomes are 
¢, =HH, ¢, =HT, ¢,; =TT, the sample space is 2 = {HH, HT, TT}, the o-field of events is 
, Q, {HT}, {TT}, {HH}, {TT or HT}, {HH or HT}, and {HH or TT}. The probabilities 
are easily computed and are, respectively, 0, 1, 1/2, 1/4, 1/4, 3/4, 3/4, and 1/2. Now define 
two random variables 


0, if at least one H 
, otherwise 


(2.6-25) 


_ J-l, «ifone H and one T 
X2(0) = on otherwise. (26-26) 


Then P[X, = 0] = 3/4, P[X, = 1] = 1/4, P[X2 = —1] = 1/2, P[X2 = 1] = 1/2. Also 
we can easily compute the probability of joint events, for example, P[X, = 0, X2 = 1] = 
P|{HH}] = 1/4. 

In defining more than one random variable on a probability space, it is possible to define 
degenerate random variables. For example suppose the underlying experiment consists of 
observing the number ¢ that is pointed to when a spinning wheel, numbered 0 to 100, comes 
to rest. Suppose we let X1(¢) = ¢ and X2(¢) = eS. This situation is degenerate because 
observing one random variable completely specifies the other. In effect the uncertainty is 
associated with only one random variable, not both; we might as well forget about observing 
the other one. If we define more than one random variable on a probability space, degeneracy 
can be avoided if the underlying experiment is complex enough, or rich enough in outcomes. 
In the example we considered at the beginning, observing that X, = 0 doesn’t specify the 
value of X2 while observing Xj = 1 doesn’t specify the value of Xj. 

The event {X <a, Y < y} a {X <a}N{Y < y} consists of all outcomes ¢ € such 
that X(¢) < a and Y(¢) < y. The point set induced by the event {X < 2,Y < y} is the 
shaded region in the x’y’ plane shown in Figure 2.6-4. In the diagram the numbers «, y 
are shown positive. In general they can have any value. The joint cumulative distribution 
function of X and Y is defined by 


Fxry(a,y) = P[X <2,Y < yl. (2.6-27) 
By definition Fxy(az,y) is a probability; thus it follows that Fyy(x,y) > 0 for all a, 


y. Since {X < co, Y < o} is the certain event, Fyy(co,co) = 1. The point set associated 
with the certain event is the whole 2’y’ plane. The event {X < —oo,y < —oo} is the 
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(x,y) 


Figure 2.6-4 Point set associated with the event {X < x, Y< y}. 


impossible event and therefore Fx y(—co,—0oo) = 0. The reader should consider the events 
{X <2,Y < —oo} and {X < —o0,Y < y}; are they impossible events also? 

Since {X < oo} and {Y < oo} are certain events, and for any event B, BNQ = B, we 
obtain 


{X <2,Y <wh={X <ax}N{Y < ow} 


={X <a}NQ 
={X <x} (2.6-28) 
so that 
Fxy(x,0o) = Fx(a) (2.6-29a) 
Fy (oo, y) = Fy(y) (2.6-29b) 


If Fxy (a, y) is continuous and differentiable, the joint pdf can be obtained from 


fry (00) = gop [Fev (a0) (2.6-30) 


It follows then, that 
fxy (a, y)dx dy = Pla << X <a+dz,y<Y <y+dy] 
and hence that fxy(x,y) > 0 for all (a, y). 


By twice integrating Equation 2.6-30, we obtain 


© y 
Fev(ey) = f ac [ dnfxy (€,1). (2.6-31) 
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Equation 2.6-31 says that Fyy(z,y) is the integral of the nonnegative function fxy (x, y) 
over the surface shown in Figure 2.6-4. It follows that integrating fxy(a,y) over a larger 
surface will generally yield a larger probability (never a smaller one!) than integrating 
over a smaller surface. From this we can deduce some obvious but important results. 
Thus, if (a1, y1) and (x2,y2) denote two pairs of numbers and if 71 < x2, yr < ye, 
then Fyy(21,y1) < Fry (2, y2). In general, Fxy(x,y) increases as (a,y) moves up and 
to the right and decreases as (x,y) moves down and to the left. Also F'yy is continuous 
from above and from the right, that is, at a point of discontinuity, say xo, yo, with ¢«, 
6 > 0: 


Fxy (£0, yo) = lim Fxy(xo + €,yo + 9). 


56-0 


Thus, at a point of discontinuity, Fxy assumes the value immediately to the right and 
above the point. 


Properties of joint CDF F xy (x, y) 
(i) Fxy(oo,0o) = 1; Fxy(-0o0,y) = Fxy(x,—00) = 0; also Fxy(x,00) = Fx (x); 
Fxy (oo, y) = Fr(y). 
(ii) If ay < 2, y1 < yo, then Fxy(x1,y1) < Fxy (#2, y2). 
(ili) Fxy(z,y) = limFxy(e+e,y+6) é,0 > 0 (continuity from the right and from 
6-0 
above). 


(iv) For all v2 > a1 and y2 > yi, we must have 


Fy (2, y2) — Fxy(#2,y1) — Fry (#1, y2) + Fxy (21, y1) > 0. 


This last and key property (iv) is a two-dimensional generalization of the nondecreasing 
property for one-dimensional CDF, that is, Fy (a2) — Fx (a1) > 0 for all a2 > 2}. It arises 
out of the need for the event {a1 < X < 22,y1 < Y < yo} to have nonnegative probability. 
The point set induced by this event is shown in Figure 2.6-5. 

The key to this computation is to observe that the set {X < 22,Y < yz} lends itself to 
the following decomposition into disjoint sets: 


{X <a2,Y <yot}={t1< X < 99,91 <Y < yo} 
U {a1 < X < a2, ¥ <yfU{X < 21,41 < Y < yo} 
U{X <a1,Y < yi}. (2.6-32) 
Now using the induced result from Axiom 3 (Equation 1.5-3), we obtain 
Fxy (£2, y2) = Plai < X < t2,y1 < Y < yp] 
+ Play < X <212,Y <yJt+P[X <a,y <Y < y| 
+ Fxy (21,41). (2.6-33) 
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xX Xp 


Figure 2.6-5 Point set for the event {x1 < X< x2, y, < Y< yo}. 


According to the elementary properties of the definite integral, the second and third terms 
on the right-hand side of Equation 2.6-33 can be written, respectively, as 


[ i fxy (€,n)d& dn = [. [. fxy(&,m)d€ dn 


~ / . [fev Gondas ty (2.6-34) 
[[ tevemacan= [ [ tor&magan 
= [. is fy (E, n)d€ dn. (2:6-35) 


But the terms on the right-hand sides of these equations are all distributions; thus, Equations 
2.6-34 and 2.6-35 become 


x2 YL 

/ / fxy (€,n)dé dn = Fxy(t2,91) — Fxy (#1, 41), (2.6-36) 
1 Yy2 

/ fry (€, n)dé dn = Fxy (a1, y2) — Fxy (#1, y1). (2.6-37) 
oO" Y1 


Now going back to Equation 2.6-33 and using Equations 2.6-36 and 2.6-37 we find that 
Fyy (2, y2) = Play < X <xa,y1 < Y < yp] 

+ Fxy(z2,y1) — Fey (%1,41) + Pxy (a1, yo) — Fxy (21,91) 

+ Fxy(21, y1)- (2.6-38) 


After simplifying and rearranging term so that the desired quantity appears on the left-hand 
side, we finally get 


Plti < X < 22,91 <Y < yo) = Fxy (a2, y2) — Fxy(x2,91) 
— Fxy(21, yo) + Fxy(z1,91). (2.6-39) 
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Equation 2.6-39 is generally true for any random variables X, Y independent or not. 
Some caution must be taken in applying Equation 2.6-39. For example, Figure 2.6-6(a,b) 
and (b) show two regions A, B involving excursions on random variables X, Y such 
that {a1 <X <a} and {y1 <Y < yo}. However, the use of Equation 2.6-39 would not be 
appropriate here since neither region is a rectangle with sides parallel to the axes. In the 
case of the event shown in Figure 2.6-6(a), a rotational coordinate transformation might 
save the day but this would involve some knowledge of transformation of random variables, 
a subject covered in the next chapter. The events whose point sets are shown in Figure 
2.6-6 can still be computed by integration of the probability density function (pdf) provided 
that the integration is done over the appropriate region. We illustrate with the following 
example. 


Example 2.6-6 
(probabilities for nonrectangular sets) We are given fxy(x,y) = e~ + u(x)u(y) and wish 
to compute P[(X,Y) € .4, where .4 is the shaded region shown in Figure 2.6-7. The region 
@ is described by .4 = {(#,y): 0< a < 1,|y| < a}. We obtain 


x Xo x, Xp 
(a) Point set for A (b) Point set for B 


Figure 2.6-6 Points sets of events A and B whose probabilities are not given by Equation 2.6-39. 


Figure 2.6-7 The region .4 for Example 2.6-5. 
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PU(X,Y) € 4= * [ e+ u(a)u(y)dx dy 


(2 e*lu)dy) e-*u(x)dax 
=| ([ dy) eda 


= i (—e-¥ | ede 


7 | (1 — e7*)e-* dx 


0 
1 
=]= —) ae 
c 5 + 5° 
1 
_ 5 et fs xe 
= 0.1998 (2.6-40) 


Example 2.6-7 
(computing CDF) Let X, Y be two random variables with joint pdf fxy(a,y) = 1 for 
0<a<1,0<y< 1, and zero elsewhere. The support for the pdf is shown in gray; the 
support for the event (—co, xz] x (—oo, y] for values 0 << « < 1,0 < y < 1 is shown bounded 
by the heavy black line. 


0<x< 1,0< y<1 


For the situation shown in the figure Fxy(2,y) = fo) fy 1da’ dy’ = ay. 
When 0 < 2 <1, y >1, we obtain Fyy(z,y) = i dx’ fo dy = x. Proceeding in this way, 
we eventually obtain a complete characterization of the CDF as 
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0<x<1,y>1 


0, «<0, ory <0, 
rzy,0<a<10<y<l, 


Fxyy(ajy)=4 «, O<aK<ly>tl, 
y, «© >10<y<l, 
1, o> lay > 1: 


As Examples 2.6-6 and 2.6-7 illustrate for specific cases, the probability of any event of 
the form {(X,Y) €.4} can be computed by the formula 


P(X, Y)¢.4 = If fxy (2, y)dex dy (2.6-41) 


provided fxy(x,y) exists. While Equation 2.6-41 seems entirely reasonable, its veracity 
requires demonstration. One way to do this is to decompose the arbitrarily shaped region 
into a (possibly very large) number of tiny disjoint rectangular regions .4),.742,...,.4N. 
Then the event {X,Y €.4} is decomposed as 


N 
{(X,Y) € 4} = U(X, Y) € 4} 


i=1 
with the consequence that (by induced Axiom 3) 


N 
P((X,Y) €. A= >> P(X, Y) €.4j]. (2.6-42) 


i=1 


But the probabilities on the right-hand side can be expressed in terms of distributions and 
hence in terms of integrals of densities (Equation 2.6-39). Then, taking the limit as NV 
becomes large and the .4; become infinitesimal, we would obtain Equation 2.6-41. 

The functions F'x(x) and Fy(y) are called marginal distributions if they are derived 
from a joint distribution. Thus, 


Fx(a) = Fev(a,0)= ff fev Gwatay (2.6-43) 


Fy(y) = Fxy(o,y) = [- [ fxy(x,)dxdn. (2.6-44) 
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Since the marginal densities are given by 


fx(a)= ae) (2.6-45) 
fry) = aoa (2.6-46) 


we obtain the following by partial differentiation of Equation 2.6-43 with respect to x and 
of Equation 2.6-44 with respect to y: 


= i, fxy (a, y)dy (2.6-47) 


=f fxv(e.pae. (2.6-48) 
We next summarize the key properties of the joint pdf fxy(z, y). 


Properties of Joint pdf's. 


(i) fxy(z,y)>0 for all z, y. 
(ii) ) f- i fxy (a, y)dx dy =1 (the certain event). 


(iii) While fxy(a, y) is not a probability, indeed it can be greater than 1, we can regard 
fxy(ax,y)dx dy as a differential probability. We will sometimes write fxy (a, y) 
dady= Plan< X <a+du,y<Y< alee 


(iv) fx(x) = JO fxy(a,y)dy and fy(y) = fo fxy(a, y)de. 


Property (i) follows from the fact that the — of the joint pdf over any region of 
the plane must be positive. Also, considering this joint pdf as the mixed partial derivative 
of the CDF, property (i) easily follows from a limiting operation applied to property (iv) 
of the joint CDF. Property (ii) follows from the fact that the integral of the joint pdf over 
the whole plane gives us the probability that the random variables will take on some value, 
which is the certain event with probability 1. 

For discrete random variables we obtain similar results. Given the joint PMF Pyy (2;, yx) 
for all x;, yx, we compute the marginal PMF’s from 


= oe Pxy (i, Yx) (2.6-49) 
all yx 

Py (yx) = S> Pxy (xis yr): (2.6-50) 
all x; 


Example 2.6-8 
(waiting time at a restaurant) A certain restaurant has been found to have the following 
joint distribution for the waiting time for service for a newly arriving customer and the total 
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For1<n<5 


For 6<n<10 


N Ligt LH, weighted sum 


0 w 


Figure 2.6-8 The CDF of Example 2.6-8: (top) number of customers in the range 1 to 5; (bottom) 
number of customers in the range 6 to 10. 


number of customers including the new arrival. Let W be a random variable representing 
the continuous waiting time for a newly arriving customer, and let N be a discrete random 
variable representing the total number of customers. 

The joint distribution function is then given as, 


0, n<Oorw <0, 
_ (L—e7¥/Ho) 0<n<5,w>Q0, 
Funw(unm) = 9 1 e-w/o) 5 4 (1 — en W/t4) (228) 5 Sn < 10,w > 0,’ 
(1 — e~¥/Ho) 3 + (1-7 W/1) (2), 10<n,w>0 


where the parameters ju; satisfy 0 < Up < f,. Note that this choice of the parameters means 
that waiting times are longer when the number of customers is large. 

Noting that W is continuous and N is discrete, we sketch this joint distribution as a 
function of w for several values of n for n > 0 and w > 0 in Figure 2.6-8. 

We next find the joint mized probability density-mass function 


0 
fw,n(w,n) 2 Dy Vr win (ws n) 
‘Ww 


) 
= Aw {Fw,n(w,n) — Fw.n(w,n — 1)} 
= 2 Fivwvtwsn) — 2 Fv.e(wen 1) 
= Dy W,N\W, 1 As W,N\W, 7 . 
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In words fw.v(w,n) is the pdf of W together or jointly with {N = n}. Calculating, we 
obtain 
(L—e~v/Ho), O<n<5, 
VnFww(w,n) = Fw.w(u,n) — Fyw(w,n-1) = uw) ¢ (1-eT’/4) 4, 5 <n< 10, 
0, else. 


Therefore, 
Tw (w ) = —V Fw ( n) 
{NN »n Or ) n ,N W, 


wae O0<n<5, 


10 pig 
= u(w) mee 5<n< 10, 
0, else. 


Thus we see a simpler view in terms of the joint pdf, where the shorter average waiting 
time 9 governs the RV W when there are less than or equal to n = 5 customers, while the 
longer average waiting time jz, governs when there are more than 5 customers. In a more 
detailed model, the average waiting time would be expected to increase with each increase 
inn. 


Independent random variables. Two RVs X and Y are said to be independent if the 
events {X < ax} and {Y < y} are independent for every combination of x, y. In Section 1.5 


two events A and B were said to be independent if P[AB] = P[A]P[B]. Taking AB S {X < 


x}A{Y < y}, where A = {X < «},B={Y < y}, and recalling that Fy (2) = P[X < al, 
and so forth for Fy(y), it then follows immediately that 


Fyy(z,y) = Fx (2) Fy(y) (2.6-51) 
for every x, y if and only if X and Y are independent. Also 
OF xy (x, y) 
= 2.6-52 
_ OF x(x) OF y(y) 
Ox Oy 
= fx(x)fy(y). (2.6-53) 
From the definition of conditional probability we obtain for independent X, Y: 
Fxy(a, y) 
Fxy(alY < a 
= Fyx(z), (2.6-54) 


and so forth, for Fy(y|X < x). From these results it follows (by differentiation) that for 
independent events the conditional pdf’s are equal to the marginal pdf’s, that is, 


fx(2l¥ < y) = fx(z) (2.6-55) 
fy(yl|X < x) = fr(y). (2.6-56) 
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It is easy to show from Equation 2.6-39 that the events {m, < X < #2} and 
{yi < Y < y2} are independent if X and Y are independent random variables, that is, 


Play <X< v2,Y1 < re y2| = Play <X< x2] Ply <Y< y2| (2.6-57) 
if Fyy(#,y) = Fx(x)Fy(y). Indeed, using Equation 2.6-39 


Play < X <a2,y1 < Y < yo] 


= Fxy(x2,y2) — Fxy (22, y1) — Fxy (21, y2) + Fxy (21,91) (2.6-58) 
= Fx (2) Fy (yo) — Fx (#2) Fy (y1) — Fx (21) Fy (y2) + Fx (21) Fy (yi) (2.6-59) 
= (Fx (x2) — Fx(21))(F¥ (y2) — Fy (y1)) (2.6-60) 
= Pla1 < X < a)Ply. < Y < yl. (2.6-61) 


Example 2.6-9 
The experiment consists of throwing a fair die once. The sample space for the experiment 
is QO = {1,2,3,4,5,6}. We define two RVs as follows: 


A J 1+¢,for outcomes ¢ = 1 or 3 
0, for all other values of ¢ 
Y(0) A ‘ — ¢, for outcomes ¢ = 1,2,or 3 
0, for all other values of ¢ 


(a) Compute the relevant single and joint PMFs. 
(b) Compute the joint CDFs Fyy(l, 1), Fry (3, —0.5), Fyy (5, —1.5). 
(c) Are the RVs X and Y are independent? 


Solution Since the die is assumed fair, each face has a probability of 1/6 of showing up. 
(a) So the singleton events {¢} are all equally likely probability P[{¢}] = 1/6. Thus, we 


obtain 
X(1) = 2, X(3) = 4,and for the other outcomes, we have 


X(2) = X(4) = X(5) = X(6) =0. 


Thus, the PMF Px is given as Px (0) = 4/6, Px (2) = 1/6, Px(4) = 1/6, and Px(k) =0 
for all other k. 
Likewise, from the definition of Y(¢), we obtain 


Y(1) =Y(4) =Y(5) =Y(6) =0, 
Y(2) = —1,and Y(3) = —2, 


thus yielding PMF values Py(0) = 4/6, Py(—1) = 1/6, Py(—2) = 1/6, and Py(k) = 0 for 
all other k. 

We next compute the joint PMFs Pxy(i,j) directly from the definition, that is, 
Pxy(i,j) = Plall ¢ : X(¢) = i,Y(¢) = jj. This is easily done if we recall that joint 
probabilities are probabilities of intersections of subsets of 2 and for example, the event of 
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observing the die faces of 2, 4, 5, or 6 is written as the subset {2,4,5,6}. Thus, Pyy (0,0) = 
Pfall ¢: X(¢) =0,¥(Q) = 0] = PUf2,4,5,6} 7 {1,4,5, 6}] = P[4,5, 6}/=1/2 
Likewise we compute: 


Pxy (2,0) = P[{1} 9 {1,4,5, 6}] = P[{1}] = 1/6 
Pxy (4,0) = P[{3} 9 {1,4,5, 6}] = Pld] = 0 

Pxy (0, -1) = P[{2, 4,5 6}N {2}] = P[{2}] = 1/6 
Pxy(2,-1) = Pl{1}n {2}] = P(g] =0 

Pxy(4,—-1) = P[{3} A {2}] = Pid] =0 

Pxy (0, —2) = P[{2, 4,5,6} m {3}] = P[¢] =0 
Pxy (2, -2) = P[{1} 7 {3}] = P(g] =0 

Pxy (4, —2) = P[{3} 9 {3}] = P[{3}] = 1/6. 


(b) For computing the joint CDF, it is helpful to graph these points and their associated 
probabilities. These probabilities are shown in parentheses. From the graph we see that 


Fxy(l, 1) = Pxy(0,0) + Pxy(0, 1) + Pxy (0,2) = 2/3. 


Likewise Fyy(3, —0.5) = Pxy(0, —1) + Pxy(2, —1) + Pxy (0, —2) + Pxy (2, —2)= ra @ and 
Fxy (5,-1.5) = Pxy(0,—2) + Pxy(2,-2) + Pxy(4,—-2) = 2. 

(c) To check for dependence, it is sufficient to find one point where the pdf (or CDF) 
does not factor. Consider then Pyy (2,0) = 1/6, but Px (2) Py(0) = 1/6 x 4/6 = 1/9, so the 
random variables X and Y are not independent. 


Probabilities associated with the Example 2.6-9. 


Example 2.6-10 
(joint pdf of independent Gaussians) 


a Gj2e" )(w? +y?) 


il 
fxy(z,y) = ano2” 


To 
= Sa (2.6-62) 
Qo? Qna2 


Hence X and Y are independent RVs. 
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Probabilities associated with the example 2.6-9. The numbers in parenthesis are the 
probabilities of reaching those points. For example, Pxy (0,0) = 1/2. 


Example 2.6-11 
(calculations with independent Gaussians) The joint pdf of two random variables is given 
by fxy (x,y) = [2n]~* exp[—$(x? + y)] for —o0 < a, y < co. Compute the probability that 
both X and Y are restricted to (a) the 2 x 2 square; and (b) the unit circle. 


Solution (a) Let Jt; denote the surface of the square. Then 


Pe (KY) eR] = ff fav (e.s)dedy (2.6-68) 
= = i: exp -3"| dx Xx rs i, exp -3"| dy (2.6-64) 
= 2erf(1) x 2erf(1) = 0.465. (2.6-65) 

y 


(b) Let 32 denote the surface of the unit circle. Then 


Plé: (X,Y) € Ra] = i) i fev tendedy (2.6-66) 


= / i. [27]-1 exp -5@ es ”)| dady. (2.6-67) 


With the substitution r 2 \/z2 + y? and tand = y/a, the infinitesimal area drdy — rdrdé, 
and we obtain 
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Pié: (X,Y) € Ry] = a Shy exp (-3" ) rar ab 
1 ‘ 1 
=e : “Tf rexp (-3") ar dé 


II 
om 
ay 

3 

g 

Ko} 

. 
bo] ke 
3 
bo 
ee 
Q 
3 


(2.6-68) 


(2.6-69) 


(2.6-70) 


(2.6-71) 


(2.6-72) 


Joint densities involving nonindependent RVs. Lest the reader think that all joint 
CDF’s or pdf’s factor, we next consider a case involving nonindependent random variables. 


Example 2.6-12 


computing joint CDF) Consider the simple but nonfactorable joint pdf 
J 
fxy(z,y) = A(a@t+ y) 0<a<1, 0<y<l, 
=0, otherwise, 


and answer the following questions. 


(i) What is A? We know that 


ff fevenaeay =1. 
1 1 il! 1 
al ay [ rd +A [ a | ydy=1>A=1. 
0 0 0 0 


(ii) What are the marginal pdf’s? 


Hence 


1 


= i fxy(z, y)dy = i. (a+ y)dy = (zy + y?/2) 


0 
={*Fe 0 <a <1, 


0, otherwise. 


Similarly, 


a [. Ixy (x, y)dx 


Se Cpe 1, 
0, otherwise. 


(2.6-73) 
(2.6-74) 
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(iii) What is Fxy(a,y)? Fxy(a, y) S P|X <2x,Y < y], so we must integrate over the 


(a) 


infinite rectangle with vertices (a, y), (w, —00), (—oo, —00), and (—oo, y). However, 


only where this rectangle actually overlaps with the region over which fxy(x,y) 4 
0, that is, the support of the pdf written supp(f) will there be a contribution to 
the integral 


z y 
Fy (x,y) = aa’ | dy' fxy(2',y’). 
x >1,y > 1 [Figure 2.6-9(a)] 
1 1 
Fxy(z,y) = | | fxy(2',y')da' dy! = 1. 
0 0 


0<a<1,y>1 [Figure 2.6-9(b)] 


1 x 
Feveu)= f_ au’ (f aa'(e’ +¥)) 
y’=0 «’/=0 


0<y<1,2>1 [Figure 2.6-9(c)] 
y 1 y 
Fxy(#,y) = A ) (a2’ + y’)da’ dy’ = S(y +1). 
y’=0 4=0 2 
0<a2<1,0<y<1 [Figure 2.6-9(d)] 
y x yx 
Fev(ew= fof (el tude! dy = 2 (e+ y. 
y'=0 J x’=0 2 


x <0, for any y; or y < 0, for any x [Figure 2.6-9(e)] 
Fyy(a, y) = 0. 
Compute P[X + Y < 1]. The point set is the half-space separated by the line 


x+y =1ory=1-—.2. However, only where this half-space intersects the region 
over which fxy(z,y) 4 0, will there be a contribution to the integral 


PIX+Y<1= // fxy(2’,y')da’ dy’. 
a’+y/<1 
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1 
(x,y) 


x’ 0 1 x’ 
(e 


) 


(d) 


Figure 2.6-9 Shaded region in (a) to (e) is the intersection of supp(fxy) with the point set associa- 
ted with the event {—oo < X < x,-co < Y < y}. In (f), the shaded region is the intersection of 
supp(fky) with {X+ Y< 1}. 


[See Figure 2.6-9(f).] Hence 


1 1-2 
PIX+Y <1 -| / (2! + y')dy' da! 
¢ ‘=0 y’=0 


In the previous example we dealt with a pdf that was not factorable. Another example 
of a joint pdf that is not factorable is 


1 
fxy(a,y) = i-- exp ( a =e (a? +y? - 20) (2.6-75) 


Qo? 


when p #0. In this case, X and Y are not independent. 

In the special case when p = 0 in Equation 2.6-75, fxy(a,y) factors as fx(x) fy(y) 
and X and Y become independent random variables. A picture of fxy(zx,y) under these 
circumstances is shown in Figure 2.6-10 for o = 1. 
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SOO KL ) WANS? 
EES ANS 


Figure 2.6-10 Graph of the joint Gaussian density 


fv esa) = @n)texp [-5(0? +2). 


As we shall see in Chapter 4, Equation 2.6-75 is a special case of the jointly Gaussian 
probability density of two RVs. We defer a fuller discussion of this important pdf until 
we discuss the meaning of the parameter p. This we do in Chapter 4. We shall see in 
Chapter 5 that Equation 2.6-75 and its generalization can be written compactly in matrix 
form. 


Example 2.6-13 
(calculation with dependent Gaussian RVs) Consider again the problem considered in 
Example 2.6-11, part (b), except let 


fy (a, y) = [20/1 — p?]~' exp -aa 


As before let Jt2 denote the surface of the unit circle. Then 
Pi (ye ml= ff fev(o,y)dedy (2.6-76) 
Reo 


= /I. [2rv/1 = p?]~! exp (-5 


1 
aay i ae 20) dx dy.(2.6-77) 
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With the polar coordinate substitution r 4 fu? + y?, tand = y/a, we obtain 


1 27 1 1 
Pic: (X,Y) € Ry] = | | (-as 2 _ Ior? cos Osi 6)) ara 
ice ) 2] Pale dee ds rexp ip" pr~ cos @sin 6) } dr 


oa 20 1 
= | / rexp (—r? [2K?(1 — psin20)]) drd@ with 

0 Jo 

il 


A : : 
Kk = — MW and sin20é = 2siné@cos6, 
2/1 — p? 
K 27 1 7 . . A 5 
=— exp — ((2K7(1 — psin20)] z) dz| dO with = z=r’, 
27 Jo 0 
ae eg 1 


7 i 2K?(1— psin20)] do 
on fy 2K%1—paim20) oP pe = paint) 


tl 7 1 — exp[—2K?(1 — psin 26)| ap 
~ A4nK Jo 1 — psin 20 


For p = 0, we get the probability 0.393, that is, the same as in Example 2.6-10. However, 
when p 4 0, this probability must be computed numerically since this integral is not avail- 
able in closed form. A MATLAB.m file that enables the computation of P(¢: (X,Y) € 2) 
is furnished below. The result is shown in Figure 2.6-11. 


Mattas.m file for computing. P[¢: (X,Y) € Ro] 


function [Pr]=corrprob 
p=[0:100]/100. ; 
Q=p*2*pi; 
Pr=zeros(1,100); 


K=.5./sqrt(1-p.*2); 


for i=1:100 
f=(1-exp (-2*K (i) *2* (1-p (i) *sin(2*q))))./(1-p(i) *sin(2*q)) ; 
Pr (i)=sum(f)/(4*pi) /K (i) *(2*pi/100) ; 

end 
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Probability that two correlated Gaussian RVs take values in the unit circle 
0.56 


0.54 
0.52 

0.5 
0.48 
0.46 
0.44 


0.42 


Probability that (X, Y) lie in the unit circle 


0.4 
0.38 
0 0.1 #O2 O03 O04 05 06 07 O8 09 1 


Correlation coefficient p 


Figure 2.6-11 Result of MATLAB computation in Example 2.6-13. 


plot (p(1:100) ,Pr) 

title(‘Probability that two correlated Gaussian RVs take values in the 
unit circle’) 

xlabel(‘Correlation coefficient rho’) 

ylabel(‘Probability that X,Y) lie in the unit circle’) 


In Section 4.3 of Chapter 4 we demonstrate the fact that as p — 1 


1 we 
fxy(a,y) > a —«). Hence 


Plc : (X,Y) €®,] = | | | = (2.6-78) 


> 0.707 1 e 
e705 O(a _ y)dady = / ge 9.58" Ga (2.6-79) 
—0.707 V2T 
= 0.52. (2.6-80) 


This is the result that we observe in Figure 2.6-11. 


Conditional densities. We shall now derive a useful formula for conditional densities 
involving two RVs. The formula is based on the definition of conditional probability given 
in Equation 1.6-2. From Equation 2.6-39 we obtain 
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Pla<X<a+Aa,y<Y<y+t+Ay] 
= Fxy(x+ Az,y + Ay) — Fxy(z,y + Ay) — Fxy(@ + Az, y) + Fry (2, y).(2.6-81) 


Now dividing both sides by Ax Ay, taking limits, and subsequently recognizing that the 
right-hand side, by definition, is the second partial derivative of Fxy with respect to x and 
y enables us to write that 

Pla< X<a+Az,y<VY<ytAy]  OFxy a 


Jim, Andy = Baas = fxy(z,y). 


Hence for Ax, Ay small 
Ple <X <a+As,y <¥ <y+Ay|~ fxy(a,y)AcAy, (2.6-82) 
which is the two-dimensional equivalent of Equation 2.4-6. Now consider 


Pla< X<a+Aa,y<Y <y+Ay] 
Pla< X<a+Ag] 


SS Ixy (x, y)Av Ay 
~ fx(z)Ac 


Ply<Y¥ <y+Ayla< X <a+Azg)] = 


(2.6-83) 


But the quantity on the left is merely 


Fyiply + Aylz < X <2+Az) — Fyp(yle < X < 2+ Az) 


Fl 


where B & {a < X <a+Az}. Hence 


' Fy \p(y t+ Aylg < X < 2+ Az) — Fyp(yle < X <2+Az) 
im 
Azx—0 Ay 


_ fxy(%,y) 
fx (2) 
_ OF xy (y|X = 2) 
= 
= fyx(ylz) (2.6-84) 


by Equation 2.6-3. The notation fy)x(y|x) reminds us that it is the conditional pdf of Y 
given the event {X = x}. We thus obtain the important formula 


peas, Ataze (2.6-85) 
fx (2) 


If we use Equation 2.6-85 in Equation 2.6-48 we obtain the useful formula: 


fry) = [- fy x (y|x) fx (x) da. (2.6-86) 
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Also 


fxiy (aly) = foe) fy(y) £0. (2.6-87) 


The quantity fx|y(2l|y) is called the conditional pdf of X given the event {Y = y}. From 
Equations 2.6-85 and 2.6-86 it follows that 


_ fy\x( |e) Fx (2) 


fxiy(aly) = - a (2.6-88) 


We illustrate with an example. 


Example 2.6-14 
(laser coherence) Suppose we observe the light field U(t) being emitted from a laser. Laser 
light is said to be temporally coherent, which means that the light at any two times t; and 
tg is statistically dependent if t2 — t, is not too large [2-5]. Let X 4 U(t), Y a U(tg) and 
tz > t1. Suppose X and Y are modeled as jointly Gaussian? as in Equation 2.6-75 with 
o? = 1. For p¥ 0, they are dependent, it turns out that using the defining Equations 2.6-47 
and 2.6-48, one can show the marginal densities fx (a) and fy(y) are individually Gaussian. 
We defer the proof of this to Chapter 4. Since the means are both zero here and the variances 
are both one, we get for the marginal densities 


1 ih 
en3t and = fy(y) = = 1 - 


both are centered about zero. Now suppose that we measure the light at t,, that is, XY and 
find that X = x. Is the pdf of Y, conditioned upon this new knowledge, still centered at 
zero, that is, is the average? value of Y still zero? 


Solution We wish to compute fy) x(y|2). 
Applying Equation 2.6-85, 


fy|x(y|z) = fa 
yields 
1 1 De oi oe is 
fyix(ylx) = {| qi ey i ee »| +52 \. 


tLight is often modeled by Poisson distribution due to its photon nature. As seen in Chapter 1, for a 
large photon count, the Gaussian distribution well approximates the Poisson distribution. Of course light 
intensity cannot be negative, but if the mean is large compared to the standard deviation (>> 0), then 
the Gaussian density will be very small there. 

+A concept to be fully developed in Chapter 4. 
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If we multiply and divide the isolated $2 term in the far right of the exponent by 1 — p?, 
we simplify the above as 


frixe) = em enp {— | oe Oe) 


Further simplifications result when quadratic terms in the exponent are combined into a 
perfect square: 


1 (y — px)? 
fyix(ylz) = On (1 — p2) a ( 2(1 - n) 


Thus, when X = z, the pdf of Y is centered at y = px and not zero as previously. If px > 0, 
Y is more likely to take on positive values, and if px < 0, Y is more likely to take on negative 
values. This is in contrast to what happens when X is not observed: The most likely value 
of Y is then zero! 


A major application of conditioned events and conditional probabilities occurs in the 
science of estimating failure rates. This is discussed in the next section. 


2.7 FAILURE RATES 


In modern industrialized society where planning for equipment replacement, issuance of life 
insurance, and so on are important activities, there is a need to keep careful records of 
the failure rates of objects, be they machines or humans. For example consider the cost of 
life insurance: Clearly it wouldn’t make much economic sense to price a five-year term-life 
insurance policy for a 25-year-old woman at the same level as, say, for a 75-year-old man. 
The “failure” probability (i-e., death) for the older man is much higher than for the young 
woman. Hence, sound pricing policy will require the insurance company to insure the older 
man at a higher price. How much higher? This is determined from actuarial tables which 
are estimates of life expectancy conditioned on many factors. One important condition is 
“that you have survived until (a certain age).” In other words, the probability that you 
will survive to age 86, given that you have survived to age 85, is much higher than the 
probability that you will survive to age 86 if you are an infant. 

Let X denote the time of failure or, equivalently, the failure time. Then by Bayes’ 
theorem, the probability that failure will occur in the interval [t,t + dt] given that the 
object has survived to t can be written as 


Pit<X<t+dt,X>¢ 
PIX >] 

But since the event {X > t} is subsumed by the event {t < X < t+4 dt}, it follows that 

Pit << X <t+dt,X >t] = Plt < X <t+dt]. Hence 


Plt< X <t+ dt] 
PIX >t] 


Plt <X <t+dt|X >t] = (2.7-1) 


Pit<X<t+dt|X >t]= (2.7-2) 
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By recalling that P[t << X <t+dt] = Fy(t+ dt) — Fx(t), we obtain 


Fx (t + dt) — F(t) 
1— F(t) 


Pit<X<t+d|X >t)= (2.7-3) 


A Taylor series expansion of the CDF F'x (t+ dt) about the point t yields (we assume that 
Fx is differentiable) 


Fx (t + dt) = Fx(t) + fx(t) dt. 


When this result is used in Equation 2.7-3, we obtain at last 


Plt< X <t+dt|X >= faa (2.7-4) 
TX 
5 a(t) dt, 
where 
A fx(t) 
a(t) = i re® (2.7-5) 


The object a(t) is called the conditional failure rate although it has other names such 
as the hazard rate, force of mortality, intensity rate, instantaneous failure rate, or simply 
failure rate. If the conditional failure rate at t is large, then an object surviving to time t 
will have a higher probability of failure in the next At seconds than another object with 
lower conditional failure rate. Many objects, including humans, have failure rates that vary 
with time. During the early life of the object, failure rates may be high due to inherent or 
congenital defects. After this early period, the object enjoys a useful life characterized by 
a near-constant failure rate. Finally, as the object ages and parts wear out, the failure rate 
increases sharply leading quickly and inexorably to failure or death. 

The pdf of the random variable X can be computed explicitly from Equation 2.7-3 
when we observe that Fy (t+ dt) — Fx (t) = Fy(t)dt = dFx. Thus, we get 


dF x 
Toy = et (2.7-6) 


which can be solved by integration. First recall from calculus that 


yo 1-y l1-y Y l-y,  Y l-y 


Second, use the facts that 


(i) Fx (0) = 0 since we assume that the object is working at t = 0 (the time that the 
object is turned on); 
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(ii) Fyx(co) = 1 since we assume that the object must ultimately fail. Then 


i oe inft —Fx(O) =f aceae. 
0 


Fx(0) L-Fx 


from which we finally obtain 
Fy (t) =1- exp - [cera : (2.7-7) 
Since Fx (co) = 1 we must have 
[ a(t’) dt! = oo. (2.7-8) 
0 


Equation 2.7-7 is the CDF for the failure time X. By differentiating Equation 2.7-7, we 
obtain the pdf 


fx(t) = a(t) exp |- [ cera] : (2.7-9) 


Different pdf’s result from different models for the conditional failure rate a(t). 


Example 2.7-1 
(conditional failure rate for the exponential case) Assume that X obeys the exponential 
probability law, that is, F(t) = (1 — e7*)u(t). We find 

fx (t) ert 


0) = Fy ® ee 


Thus, the conditional failure rate is a constant. Conversely, if a(t) is a constant, the failure 
time obeys the exponential probability law. 


An important point to observe is that the conditional failure rate is not a pdf (see 
Equation 2.7-8). The conditional density of X, given {X > t}, can be computed from the 
conditional distribution by differentiation. For example, 


Fx(a|X > t) 2 P[X <a|X > 4] 
P[X <2,X >] 
= ———__—. 2.7-10 
P[X >1] ( ) 
The event {X <a2,X >t} is clearly empty ift > a. Ift <a, then {X <a2,X >t} = 
{t< X <a}. Thus, 


0, t> a, 
Fx(2|X >t) = 4 Fx(a) — Fx(t) 
1-Fx(t) ’ 


eed. (2.7-11) 
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Hence 


0 Cen, 


fx(2|X > t) = “fx(2) osu (2.7-12) 
1-Fx(@)’ ~=" 


The connection between a(t) and fx(a|X > t) is obtained by comparing Equation 2.7-12 
with Equation 2.7-5, that is, 


fx(t|X > t) = a(t). (2.7-13) 


Example 2.7-2 
(Itsibitsi breakdown) Oscar, a college student, has a nine-year-old Itsibitsi, an import car 
famous for its reliability. The conditional failure rate, based on field data, is a(t) = 0.06tu(t) 
assuming a normal usage of 10,000 mile/year. To celebrate the end of the school year, Oscar 
begins a 30-day cross-country motor trip. What is the probability that Oscar’s Itsibitsi will 
have a first breakdown during his trip? 


Solution First, we compute the pdf fx(t) as 
fx (t) = 0.06te Jo 9-06#! at", 4) (2.7-14) 
= 0.06te~ 0-3" u(t). (2.7-15) 
Next, we convert 30 days into 0.0824 years. Finally, we note that 


P(9.0 < X < 9.0824 
P(9.0 < X <9.0824|x > 9) - PPO<*S 


1— Fx (9) : 
where we have used Bayes’ rule and the fact the event {9 < X < 9.0824}N{X >9}={9< 
X < 9.0824}. 
Since 
9.0824 . 
P[9.0 < X < 9.0824] = 0.06 | pen de (2.7-16) 
9.0 
(9.0824)? 
a e 9 82qz with zSt?,  — (2.7-17) 
2 (9.0)? 
0.06 1 —0.03(9.0)? —0.03(9 eae) 
im 03(9.0)? __ ¢-0.03(9. 2.7-18 
2 0.03 (« ° ( ) 
_ Ge _ ene (2.7-19) 
~ 0.0038 (2.7-20) 
and 


1 — Fx (9) = 0.088, 
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Oscar’s car has a 3.8 x 10~3/8.8 x 10~? or 0.043 probability of suffering a breakdown in the 
next 30 days. 

Incidentally, the probability that a newly purchased Itsibitsi will have at least one 
breakdown in ten years is 0.95. 


SUMMARY 


The material discussed in this chapter is central to the concept of the whole book. We began 
by defining a real random variable as a mapping from the sample space ( to the real line 
R. We then introduced a point function F'x(x) called the cumulative distribution function 
(CDF), which enabled us to compute the probabilities of events of the type {¢: ¢ € Q, 
X(¢) < x}. The probability density function (pdf) and probability mass function (PMF) 
were derived from the CDF, and a number of useful and specific probability laws were 
discussed. We showed how, by using Dirac delta functions, we could develop a unified theory 
for both discrete and continuous random variables. We then discussed joint distributions, 
the Poisson transform, and its inverse and the application of these concepts to physical 
problems. 

We discussed the important concept of conditional probability and illustrated its appli- 
cation in the area of conditional failure rates. The conditional failure, often high at the 
outset, constant during mid-life, and high at old age, is fundamental in determining the 
probability law of time-to-failure. 


PROBLEMS 


(*Starred problems are more advanced and may require more work and/or additional 

reading.) 

2.1 The event of k successes in n tries regardless of the order is the binomial law b(k, n; p). 
Let n = 10, p= 0.3. Define the RV X by 

1, for0O<k <2, 

2, for2<k<5, 

3, fori<k <8, 

4, for8<k< 10. 

Compute the probabilities P[X = j] for 7 = 1,...,4. Plot the CDF F(x) = P[X < 
x] for all x. 

*2.2 Consider the probability space (Q, 7, P). Give an example, and substantiate it in a 
sentence or two, where all outcomes have probability zero. Hint: Think in terms of 
random variables. 

2.3 Inarestaurant known for its unusual service, the time X, in minutes, that a customer 


has to wait before he captures the attention of a waiter is specified by the following 
CDF: 


X(k) = 
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2.4 


2.5 


*2.6 


2.7 


rv 2 
(5). forO <a <1, 
2 
7p for l<a< 2, 
Exe) = . for 2< x < 10, 
heal for 10 <x < 20 
20’ Se 
1 for x > 20. 


(a) Sketch Fx (a). (b) Compute and sketch the pdf fx (x). Verify that the area under 
the pdf is indeed unity. (c) What is the probability that the customer will have to 
wait (1) at least 10 minutes, (2) less than 5 minutes, (3) between 5 and 10 minutes, 
(4) exactly 1 minute? 

Compute the probabilities of the events {X < a}, {X <a}, {a<X <b}, {a<X< 
b}, {a < X < b}, and {a < X < 5} in terms of F'x(x) and P[|X = a] for x = a,b. 
In the following pdf’s, compute the constant B required for proper normalization: 
Cauchy (a < co, 3 > 0): 


B 


PO) TT a) oP 


OO << OO. 


Maxwell (a > 0): 
2-27 /a? 
fle) = { Bove : t>0, 
0, otherwise. 


For these more advanced pdf’s, compute the constant B required for proper normal- 
ization: 
Beta (b > —1,c > —1): 


_ | Beh l—2), 0<2 <1, 
Ix(@) = ‘ otherwise. 
(See formula 6.2-1 on page 258 of [2-6].) 
Chi-square (¢ > 0,n = 1,2,...): 
fs es ae. oS 0, 


0, otherwise. 


A noisy resistor produces a voltage v,(t). At t = t1, the noise level X S Un(t1) is 
known to be a Gaussian RV with pdf 


fx(e) = ao? | ; (2)']. 


Compute and plot the probability that |X| > ko for k = 1,2,.... 


PROBLEMS 143 


2.8 
2.9 


2.10 


2.14 


Compute Fy (ko) for the Rayleigh pdf (Equation 2.4-15) for k = 0,1,2,.... 

Write the probability density functions (using delta functions) for the Bernoulli, 
binomial, and Poisson PMF’s. 

The pdf of a RV X is shown in Figure P2.10. The numbers in parentheses indicate 
area. (a) Compute the value of A; (b) sketch the CDF; (c) compute P[2 < X < 3]; 


Figure P2.10 pdf of a Mixed RV. 


(d) compute P[2 < X < 3]; (e) compute Fx (3). 

The CDF of a random variable X is given by F(a) = (1 — e~*)u(x). Find the 
probability of the event {¢: X(¢) <1 or X(¢) > 2}. 

The pdf of random variable X is shown in Figure P2.10. The numbers in parentheses 
indicate area. Compute the value of A. Compute P[2 < X < 4]. 

(two coins tossing) The experiment consists of throwing two indistinguishable coins 
simultaneously. The sample space is 2 ={two heads, one head, no heads}, which we 
denote abstractly as Q = {¢),¢5,¢3}. Next, define two random variables as 


X(¢4) =0, X(¢g) = 0, X(¢g) = 1 
Y¥(¢1) =1, Y(Co) =—-1, Y(¢3) = 1. 


(a) Compute all possible joint probabilities of the form P[¢: X(¢) =a,Y(¢) = 
A], Qe {0,1}, B € {=1, Ij. 


(b) Determine whether X and Y are independent random variables. 


The pdf of the random variable X is shown in Figure P2.14. The numbers in paren- 
theses indicate the corresponding impulse area. 
So, 


1 1 i Ax?, |x| < 2, 
fx(x) = <6(a +2) + —d(x +1) + —d(2 +f 0, sil 


Note that the density fx is zero off of [—2,+2]. 
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2.15 


*2.16 


2.18 


f(x) 


(8) \ ate) | ate) | Sgn 


2 <-l 0 1 2 


Figure P2.14 pdf of the Mixed pv in the problem 2.14. 


(a) Determine the value of the constant A. 
(b) Plot the CDF Fx (a). Please label the significant points on your plot. 
(c) Calculate Fy (1). 


(d) Find P[-1 < X <2]. 
Consider a binomial RV with PMF 0(k; 4,3). Compute PLX = k|X even] for k = 
0,...,4. 
Continuing with Example 2.6-8, find the marginal distribution function Fy (n). Find 
and sketch the corresponding PMF Py(n). Also find the conditional probability 
density function fw(w|N =n) = fwiw(w|n). In words fw|n(w|n) is the pdf of W 
given that N =n.) 
The time-to-failure in months, X, of light bulbs produced at two manufacturing 
plants A and B obey, respectively, the following CDF's 


Fx(x) = (1— e7?/°)u(a) for plant A (2.7-21) 
Fx (x) = (1—e7*/?)u(zx) for plant B. (2.7-22) 


Plant B produces three times as many bulbs as plant A. The bulbs, indistinguishable 
to the eye, are intermingled and sold. What is the probability that a bulb purchased 
at random will burn at least (a) two months; (b) five months; (c) seven months? 


Show that the conditioned distribution of X given the event A= {b < X < a} is 
0, x<od, 
Fx (x) — Fx (0) 
F: A)= b < 
(ald) = 9 Fe b< ada, 
1 r>a. 


It has been found that the number of people Y waiting in a queue in the bank on 
payday obeys the Poisson law as 


k 


PIY =k|X =a] =e", k>0,2>0 
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2.21 


2.22 


2.23 


given that the normalized serving time of the teller « (i.e., the time it takes the teller 
to deal with a customer) is constant. However, the serving time is more accurately 
modeled as an RV X. For simplicity let X be a uniform RV with 


fx(x) = 5[u(x) — u(x — 5)]. 


Then P[Y = k|X = a] is still Poisson but P[Y = &] is something else. Compute 
P|Y =k] for k = 0,1, and 2. The answer for general & may be difficult. 
Suppose in a presidential election each vote has equal probability p = 0.5 of being 
in favor of either of two candidates, candidate 1 and candidate 2. Assume all votes 
are independent. Suppose 8 votes are selected for inspection. Let X be the random 
variable that represents the number of favorable votes for candidate 1 in these 8 
votes. Let A be the event that this number of favorable votes exceeds 4, that is, 
A= {xX > 4}. 
(a) What is the PMF for the random variable X ? Note that the PMF should be 
symmetric about X = k= 4. 
(b) Find and plot the conditional distribution function F'y(z|A) for the range 
-1l<a< 10. 
(c) Find and plot the conditional pdf fx(x|A) for the range —1 < x < 10. 
(d) Find the conditional probability that the number of favorable votes for candi- 
date 1 is between 4 and 5 inclusive, that is, P[4 << X < 5|A]. 


Random variables X and Y have joint pdf 


392(1—y),0<2<2,0<y<l, 
0, else. 


fx,y(2,y) -{ 


) < 
(b) Fy-(0.5). 

) PLX <0.5|Y < 0.5). 
(d) P[Y <0.5|X < 0.5]. 


Consider the joint pdf of X and Y: 
b: Sirs ja tiot cee 
fary (ery) = ge HOM HO Mula )uly). 


Are X and Y independent RVs? Compute the probability of {0< X <3,0<Y <2}. 
Consider the random variable X with pdf fx(«) given by 


A(l+a2), -l<2«<0, 
fx(t)=4§ Al—-2), O0<2<1, 
0, elsewhere. 


(a) Find A and plot fx(z); 


a 
(b) Plot Fx (x), the pdf; 
(c) Find point 6 such that 
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2.26 


2.27 


2.28 


2.29 


P[X > 0] = PIX <9. 


Show that Equation 2.6-75 factors as fx (x) fy(y) when p = 0. What are fx (a) and 
fy(y)? For o = 1 and p=0, what is P-§ << X <5,-53<Y <5]? 

Consider a communication channel corrupted by noise. Let X be the value of the 
transmitted signal and Y be the value of the received signal. Assume that the condi- 
tional density of Y given {X = a} is Gaussian, that is, 


fyixtule) = ez ew (4), 


and X is uniformly distributed on [—1,1]. What is the conditional pdf of X given 
¥, that is, fxiy (aly)? 

Consider a communication channel corrupted by noise. Let X be the value of the 
transmitted signal and Y the value of the received signal. Assume that the condi- 
tional density of Y given X is Gaussian, that is, 


fy|x(yl|x) = as ( we, 


and that X takes on only the values +1 and —1 equally likely. What is the conditional 
density of X given Y, that is, fxjy(xly)? 

The arrival time of a professor to his office is a continuous RV uniformly distributed 
over the hour between 8 A.M. and 9 A.M. Define the events: 


A = {The prof. has not arrived by 8.30 A.M.}, (2.7-23) 
B= {The prof. will arrive by 8:31 A.M.}. (2.7-24) 
Find 
(a) P[B|A]. 
(b) P[A|5]. 
Let X be a random variable with pdf 
0, x<0, 
fele)={ "on og >0. 


(a) Find c; 
(b) Let a>0, x > 0, find P[X > a+ al; 
(c) Leta >0,a>0, find P/X >a+a|X > al. 


To celebrate getting a passing grade in a course on probability, Wynette invites her 
Professor, Dr. Chance, to dinner at the famous French restaurant C’est Tres Chere. 
The probability of getting a reservation if you call y days in advance is given by 
1—e-¥, where y > 0. What is the minimum numbers of days that Wynette should 
call in advance in order to have a probability of at least 0.95 of getting a reservation? 
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A U.S. defense radar scans the skies for unidentified flying objects (UFOs). Let 4 
be the event that a UFO is present and M° the event that a UFO is absent. Let 
fxjm(z|M) = Taz exP( 0.5[2 — r]?) be the conditional pdf of the radar return 
signal X when a UFO is actually there, and let fx/y(x2/M°) = Tz exp(—0.5[2]”) 
be the conditional pdf of the radar return signal X when there is no UFO. To be 
specific, let r = 1 and let the alert level be x4 = 0.5. Let A denote the event of an 
alert, that is, {X > a4}. Compute P[A|M], P[A‘°|M], P[A|M*], P[AC|M*¢]. 


In the previous problem assume that P[M] = 10~%. Compute 


P[M|A], P[M|A‘], P[M°|A], P[M°|A‘]. Repeat for P[M] = 107°. 


Note: By assigning drastically different numbers to P|M], this problem attempts to 
illustrate the difficulty of using probability in some types of problems. Because a 
UFO appearance is so rare (except in Roswell, New Mexico), it may be considered a 
one-time event for which accurate knowledge of the prior probability P[M] is near 
impossible. Thus, in the surprise attack by the Japanese on Pearl Harbor in 1941, 
while the radar clearly indicated a massive cloud of incoming objects, the signals 
were ignored by the commanding officer (CO). Possibly the CO assumed that the 
prior probability of an attack was so small that a radar failure was more likely. 
(research problem: receiver-operating characteristics) In Problem, P[A|M‘] is known 
as a, the probability of a false alarm, while P[M|A] is known as (, the probability 
of a correct detection. Clearly a = a(a4), 3 = G(aa). Write a MATLAB program to 
plot @ versus q@ for a fixed value of r. Choose r = 0,1,2,3. The curves so obtained 
are known among radar people as the receiver-operating characteristic (ROC) for 
various values of r. 

A sophisticated house security system uses an infrared beam to complete a circuit. If 
the circuit is broken, say by a robber crossing the beam, a bell goes off. The way the 
system works is as follows: The photodiode generates a beam of infrared photons at 
a Poisson rate of 9 x 10° photons per second. Every microsecond a counter counts the 
total number of photons collected at the detector. If the count drops below 2 photons 
in the counting interval (10~° seconds), it is assumed that the circuit is broken and 
the bell rings. Assuming the Poisson PMF, compute the probability of a false alarm 
during a one-second interval. 

A traffic light can be in one of three states: green (G), red (R), and yellow (Y). 
The light changes in a random fashion (e.g., the light at the corner of Hoosick and 
Doremus in Nirvana, New York). At any one time the light can be in only one state. 
The experiment consists of observing the state of the light. 


(a) Give the sample space of this experiment and list five events. 
(b) Let a random variable X(-) be defined as follows: X(G) = —1; X(R) = 0; 
X(Y) = 7. Assume that P[G] = P[Y] = 0.5 x P[R]. Plot the pdf of X. What 
is P[X < 3]? 
A token-based, multi-user communication system works as follows: say that nine 
user-stations are connected to a ring and an electronic signal, called a token, is passed 
around the ring in, say, a clockwise direction. The token stops at each station and 
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allows the user (if there is one) up to five minutes of signaling a message. The token 
waits for a maximum of one minute at each station for a user to initiate a message. 
If no user appears at the station at the end of the minute, the token is passed on 
to the next station. The five-minute window includes the waiting time of the token 
at the station. Thus, a user who begins signaling at the end of the token waiting 
period has only four minutes of signaling left. 


(a) Assume that you are a user at a station. What are the minimum and maximum 
waiting times you might experience? The token is assumed to travel instan- 
taneously from station to station. 

(b) Let the probability that a station is occupied be p. If a station is occupied, 
the “occupation time” is a random variable that is uniformly distributed in 
(0,5) minutes. Using MATLAB, write a program that simulates the waiting 
time at your station. Assume that the token has just left your station. Pick 
various values of p. 


Let X and Y be jointly Gaussian RVs with pdf 


fxy(a,y) = 7 exp | = (2? 4 ”). 


What is the smallest value of c such that P[X? + Y? < c?] > 0.95. Hint: Use polar 
coordinates. 


2.37 We are given the following joint pdf for random variables X and Y: 


A,O < |a|+|y| < 1, 
povte = {405i 


) What is the value of the constant A? 

) What is the marginal density fx (a)? 

) Are X and Y independent? Why? 

) What is the conditional density fy) x (y|x)? 


( 


(a 
(b 
c 
(d 


Figure P2.37 Support of fxy(x, y) in problem 2.37. 
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2.38 A laser used to scan the bar code on supermarket items is assumed to have a constant 
conditional failure rate \(>0). What is the maximum value of \ that will yield a 
probability of a first breakdown in 100 hours of operation less than or equal to 0.05? 

2.39 Compute the pdf of the failure time X if the conditional failure rate a(t) is as shown 
in Figure P2.39. 


a(t) 


Figure P2.39 Failure rate a(t) in problem 2.39. 
2.40 Use the basic properties of the joint CDF Fxy(zx,y) to show 
(a) fxy (@, y)da dy =Pla<X<a+dzr,y<Y <y+dyj; 
(b) [- [- fxy(a, y)dx dy = 1; and 


(c) fxy(a,y) > 0. 
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Functions of Random 
Variables 


3.1 INTRODUCTION 


A classic problem in engineering is the following: We are given the input to a system and 
we must calculate the output. If the input to a system is random, the output will generally 
be random as well. To put this somewhat more formally, if the input at some instant ¢ or 
point z is a random variable (RV), the output at some corresponding instant ¢’ or point w’ 
will be a random variable. Now the question arises, if we know the CDF, PMF, or pdf of the 
input RV can we compute these functions for the output RV? In many cases we can, while 
in other cases the computation is too difficult and we settle for descriptors of the output 
RV which contain less information than the CDF. Such descriptors are called averages or 
expectations and are discussed in Chapter 4. In general for systems with memory, that is, 
systems in which the output at a particular instant of time depends on past values of the 
input (possibly an infinite number of such past values), it is much more difficult (if not 
impossible) to calculate the CDF of the output. This is the case for random sequences 
and processes to be treated in Chapters 7 and 8. In this chapter, we study much simpler 
situations involving just one or a few random variables. We illustrate with some examples. 


Example 3.1-1 
(power loss in resistor) As is well known from electric circuit theory, the current I flowing 
through a resistor R (Figure 3.1-1) dissipates an amount of power W given by 


W(1)=PR. (3:1-1) 
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Figure 3.1-1 Ohmic power dissipation in a resistor. 


Equation 3.1-1 is an explicit rule that generates for every value of J a number W(J). This 
rule or correspondence is called a function and is denoted by W(-) or merely W or some- 
times even W(J)—although the latter notation obscures the difference between the rule 
and the actual number. Clearly, if J were a random variable, the rule W = I?R generates a 
new random variable W whose CDF might be quite different from that of I.’ Indeed, this 
alludes to the heart of the problem: Given a rule g(-), and a random variable X with pdf 
fx(x), what is the pdf fy(y) of the random variable Y = g(X)? 


The computation of fy(y), Fy(y), or the PMF of Y, that is, Py (y;), can be very simple 
or quite complex. We illustrate such a computation with a second example, one that comes 
from communication theory. 


Example 3.1-2 
(waveform detector) A two-level waveform is made analog because of the effect of additive 
Gaussian noise (Figure 3.1-2). A decoder samples the analog waveform «(t) at to and decodes 
according to the following rule: 


Input to Decoder x | Output of Decoder y 


If x(to): Then y is assigned: 
25 1 
<4 0 


What is the PMF or pdf of Y? 


Solution Clearly with Y (an RV) denoting the output of the decoder, we can write the 
following events: 
{Y¥ =0} ={X <0.5} (3.1-2a) 


{y¥ =1} = {X > 0.5}, (3.1-2b) 
where X 4 x(to). Hence if we assume X: N(1,1), we obtain the following: 
Prag =r <ogu = [- i as 
V2m Joo 
~ 0.31. (3.1-3) 


+This is assuming that the composite function P(¢)R satisfies the required properties of an RV (see 
Section 2.2). 
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Additive 
Gaussian Sampler 
noise at to 


Transmitter ie Decoder 
| | (a) (b) x(t) X(to) 


1 
2 
t 
(a) 
Figure 3.1-2 Decoding of a noise-corrupted digital pulse by sampling and hard clipping. 


NR[= 


P[Y=0] N (1, 1) 


Figure 3.1-3 The area associated with P[Y = 0] in Example 3.1-2. 


In arriving at Equation 3.1-3 we use the normalization procedure explained in Section 2.4 
and the fact that for X: N(0,1) and any x < 0, the CDF Fx(x) = } —erf(|z|). The area 
under the Normal N(1,1) curve associated with P[Y = 0] is shown in Figure 3.1-3. 

In a similar fashion we compute P[Y = 1] = 0.69. Hence the PMF of Y is 


0.31, y = 0, 
Py(y) = § 0.69, y= 1, (3.1-4) 
0, else. 


Using Dirac delta functions, we can obtain the pdf of Y: 


fy (y) = 0.31 6(y) + 0.69 d(y — 1). (3.1-5) 
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In terms of the Kronecker delta function, that is, d(y) = 1 at y = 0, equal 0 else, the PMF 
would be 

Py(y) = 0.31 6(y) + 0.69 d(y — 1). 
The Knonecker 6 is used in PMFss while the Dirac 6 is used in pdffs. We keep the symbols 
the same although they mean different things. 


Of course not all function-of-a-random-variable (FRV) problems are this easy to evaluate. 
To gain a deeper insight into the FRV problem, we take a closer look at the underlying 
concept of FRV. The gain in insight will be useful when we discuss random sequences and 
processes beginning in Chapter 7. 


Functions of a Random Variable (FRV): Several Views 


There are several different but essentially equivalent views of an FRV. We will now present 
two of them. The differences between them are mainly ones of emphasis. 

Assume as always an underlying probability space P= (Q,.% P) and a random variable 
X defined on it. Recall that X is a rule that assigns to every ¢ € Q“ a number X(¢). X 
transforms the o-field of events .Y into the Borel o-field .7 of sets of numbers on the real 
line. If Rx denotes the subset of the real line actually reached by X as ¢ roams over 2, then 
we can regard X as an ordinary function with domain 2 and range Rx. Now, additionally, 
consider a measurable real function g(x) of the real variable x. 


First view (Y: Q — Ry). For every ¢ € 9, we generate a number g(X(¢)) Y(¢). The 


rule Y, which generates the numbers {Y(¢)} for random outcomes {¢ € Q}, is an RV with 
domain 2 and range Ry C R!. Finally for every Borel set of real numbers By, the set 
{¢: Y(¢) € By} is an event. In particular the event {¢: Y(¢) < y} is equal to the event 
{¢: 9(X(0)) < y}- 

In this view, the stress is on Y as a mapping from 2 to Ry. The intermediate role of 
X is suppressed. 


Second view (input/output systems view). For every value of X(¢) in the range Rx, 
we generate a new number Y = g(X) whose range is Ry. The rule Y whose domain is Rx 
and range is Ry is a function of the random variable X. In this view the stress is on viewing 
Y as a mapping from one set of real numbers to another. A model for this view is to regard 
X as the input to a system with transformation function g(-).' For such a system, an input 
x gets transformed to an output y = g(x) and an input function X gets transformed to an 
output function Y = g(X). (See Figure 3.1-4.) 

The input-output viewpoint is the one we stress, partly because it is particularly useful 
in dealing with random processes where the input consists of waveforms or sequences of 
random variables. The central problem in computations involving FRVs is: Given g(a) and 
F x(a), find the point set C, such that the following events are equal: 


+g can be any measurable function; that is, if Ry is the range of Y, then the inverse image (see Section 
2.2) of every subset in Ry generated by countable unions and intersections of sets of the form {Y < y} is 
an event. 
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X—| gl) Key 


Figure 3.1-4 Input/output view of a function of a random variable. 


{¢: Y(¢) Sy} ={C: g|X(Q)] < y} 
= {¢: X(¢) € Cy}. (3.1-6a) 


In general we will economize on notation and write Eq. 3.1-6a as {Y < y} = {X € Cy} in 
the sequel. For C,, so determined it follows that 


PLY <y] =P[X €C,] (3.1-6b) 


since the underlying event is the same. If C, is empty, then the probability of {Y < y} is 
zero. 

In dealing with the input-output model, it is generally convenient to omit any references 
to an abstract underlying experiment and deal, instead, directly with the RVs X and Y. 
In this approach the underlying experiments are the observations on X, events are Borel 
subsets of the real line R', and the set function P{-] is replaced by the distribution function 
Fx(-). Then Y is a mapping (an RV) whose domain is the range Rx of X, and whose range 
Ry is a subset of R'. The functional properties of X are ignored in favor of viewing X 
as a mechanism that gives rise to numerically valued random phenomena. In this view the 
domain of X is irrelevant. 

Additional discussion on the various views of an FRV are available in the literature.' 


3.2 SOLVING PROBLEMS OF THE TYPE Y = g(X) 


We shall now demonstrate how to solve problems of the type Y = g(X). Eventually we shall 
develop a formula that will enable us to solve problems of this type very rapidly. However, 
use of the formula at too early a stage of the development will tend to mask the underlying 
principles needed to deal with more difficult problems. 


Example 3.2-1 
Let X be a uniform RV on (0,1), that is, X¥:U(0,1), and let Y = 2X +3. Then we need to 
find the point set C, in Equation 3.1-6b to compute Fy (y). Clearly 


{Y Sy} = {2X +3<y}={X < 5(y—3)}. 
Hence Cy, is the interval (—oo, $(y — 3)) and 
—3 
Fri) = Fx (45°). 


+For example see Davenport [3-1, p.174] or Papoulis and Pillai [3-5]. 
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0 1 x 
(a) 
fly) 
0.5 cae 
| I 
| I 
0 1 2 3 4 5 y 
(b) 


Figure 3.2-1 (a) Original pdf of X; (b) the pdf of Y= 2X+3. 
The pdf of Y is 
dFy(y) _ d y—3 1 y—3 
a a F = : 
fy) em mer 2 tea ager 


The solution is shown in Figure 3.2-1. 


Generalization. Let Y = aX +0 with X a continuous RV with pdf fx (x). Then for a > 0 
the outcomes {¢} C 2 that produce the event {aX +b < y} are identical with the outcomes 
{¢} CQ that produce the event {X < —*}. Thus, 


a 


(¥ <up={axto<yp= {xc 'h. 


From the definition of the CDF: 


Fy(y) = Fx (X=), (3.2-1) 


and so 


fry) = * fx (<*) (3.2-2) 


For a < 0, the following events are equalt 


(¥ sup=(oxtosyp={x> th 


+By which we mean that the event {¢: Y(¢) < y} = 16 X(¢) > vat). 
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Since the events +X < ut and {x > ut} are disjoint and their union is the certain 


event, we obtain from Axiom 3 


pix<4*)4p|x>4 "| a1. 
a 


Finally for a continuous RV 


and 


Thus, for a < 0 


Fy(y) =1- Fx (=) (3.2-3) 
™ fy(y) = atx (XH) , a0. (3.2-4) 


When X is not necessarily continuous, we would have to modify the development in the 
case a < 0 because it may no longer be true that P [x < w=) =P [x < ut) because of 
the possibility that the event {X = yb} has a positive probability. The modified statement 
then becomes P |X < w=) =P xX ut) —-P[xX= y) = Fx (4) = Px (#), where 
we have employed the PMF Px to subtract the probability of this event. The final answer 
for the case a < 0 must be changed accordingly. 

Example 3.2-2 
(square-law detector) Let X be an RV with continuous CDF F'x(a) and let Y 2 x?. Then 


{¥ <y} ={X? <a} ={-Vu s XS Vo} = (VU < XS Vy} ULK =-yo}. (3.25) 


The probability of the union of disjoint events is the sum of their probabilities. Using the 
definition of the CDF, we obtain 


Fy (y) = Fx(/@) — Fx(—V) + PIX = -,/). (3.26) 
If X is continuous, P/X = —,/y] = 0. Then for y > 0, 
f= = ¥(¥)] a qix vi + a glx-vo- (3.2-7) 


For y < 0, fy(y) = 0. How do we know this? Recall from Equation 3.1-6a that if C, is 
empty, then P[Y € C,] = 0 and hence fy(y) = 0. For y < 0, there are no values of the RV 


X on the real line that satisfy 
{Vy SX < Vy}. 
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Hence fy(y) = 0 for y < 0. If X: N(0,1), then from Equation 3.2-7, 


fy(y) = e ?¥u(y), (3.2-8) 


where u(y) is the standard unit step function. Equation 3.2-7 is the Chi-square pdf with 
one degree-of-freedom. 
Example 3.2-3 


(half-wave rectifier) A half-wave rectifier has the transfer characteristic g(a) = xru(a) 
(Figure 3.2-2). 


g(x) 


x 


Figure 3.2-2 Half-wave rectifier. 


Thus, 
Fy(y) = P[Xu(X) <y] = / fx(a)dz. (3.2-9) 
{w: cu(a)<y} 
(i) Let y > 0; then {x: xu(a) < ys} ={a: a >0;a < ys}Uf{a: a <0} = {a: a < y}. 
Thus Fy(y) = f*., fx(2)dx = Fx(y). 
(ii) Next let y =0. Then P[Y = 0] = PLX <0] = Fx(0). 
(iii) Finally let y < 0. Then {x: xu(x) < y} = ¢ (the empty set). 


hus, 
F = x)dx = 0. 
y (y) [re ) 


If X: N(0,1), then Fy(y) has the form in Figure 3.2-3. 
The pdf is obtained by differentiation. Because of the discontinuity of y = 0, we obtain 
a Dirac impulse in the pdf at y = 0, that is, 


0, y <0, 
fy(y) = 4 Fx(0)dy), y=9, (3.2-10) 
fx(y), y> 0. 
Fy(y) 
: 
2 


Figure 3.2-3.| The CDF of Y when X: N(0,1) for the half-wave rectifier. 
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This can be compactly written as fy(y) = fx(y)u(y) + Fx(0)6(y). We note that in this 
problem it is not true that P[Y <0] = P[Y <0]. There is a non-zero probability that 
PIY = 0). 


Example 3.2-4 
Let X be a Bernoulli RV with P[X = 0] = p and P[X = 1] =q. Then 


fx (x) = pd(a) + qd(a — 1) and Fx (x) = pu(x) + qu(x — 1), 


where u(a) is the unit-step function of continuous variable «. 
Let Y 2 X —1. Then (Figure 3.2-4) 


Fy(y) = P[X-1<y] 
=P[X<yt]] 
= Fx(y+1) 
= pu(y + 1) + qu(y). 


The pdf is 


fy(y) = a [Fy (y)] = po(y + 1) + qo(y). (3.2-11) 


(b) 


Figure 3.2-4 (a) CDF of X: (b) CDF of Y= X—1. 
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Example 3.2-5 
(transformation of CDFs) Let X have a continuous CDF F'x(x) that is a strict monotone 
increasing function’ of x. Let Y be an RV formed from X by the transformation with the 
CDF function itself, 


Y = F(X). (3.2-12) 
To compute Fy (y), we proceed as usual: 
{Y <y} = {Fx(X) <y} 
= {X < Fx*(y)}. 


Hence 


= / fx (a)da. 
fe: Fx («)<yt 


1. Let y < 0. Then since 0 < Fx(x) < 1 for all x € [—co, ov], the set {: Fx (x) < 
y} = d and Fy(y) =0. 

2. Let y > 1. Then {a: Fy(x) < y} = [—o0, co] and Fy(y) = 1. 

3. Let 0O<y <1. Then {a: Fx(a) < yf ={ain< Fx (y)} 


and 
Fx"(y) : 
Fry) = f° fx(o)de = Fe(FE1Q) =v. 
Hence 
0, y<0O, 
Fy(yy= Vy, OS y<1, (3.2-13) 
1, y>l. 


Equation 3.2-13 says that whatever probability law X obeys, so long as it is continuous and 
strictly monotonic, Y =F ‘x (X) will be a uniform. Conversely, given a uniform distribution 
for Y, the transformation X SF x (Y) will generate an RV with contiuous and strictly 
monotonic CDF Fx (a) (Figure 3.2-5). This technique is sometimes used in simulation to 
generate RVs with specified distributions from a uniform RV. 


Example 3.2-6 
(transform uniform to standard Normal) From the last example, we can transform a uniform 
RV X:U [0,1] to any continuous distribution that has a strictly increasing CDF. If we want 
a standard Normal, that is, Gaussian Y:N(0,1), its CDF is given as 


tIn other words x2 > 21 implies Fx (a2) > Fx (x1), that is, without the possibility of equality. 
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Figure 3.2-5 Generating an RV with CDF Fx(x) from a uniform RV. (a) Creating a uniform RV Y 
from an RV X with CDF Fx(x); (b) creating an RV with CDF Fx(x) from a uniform RV Y. 


A plot of this transformation is given in Figure 3.2-6. 

A MATLAB program transformCDF.m available at the book website can be used to 
generate relative frequency histograms of this transformation in action. The following results 
were obtained with 1000 trials. Figure 3.2-7 shows the histogram of the 1000 RVs distributed 
as U[0, 1]. Figure 3.2-8b shows the corresponding histogram of the transformed RVs. 


Example 3.2-7 
(quantizing) In analog-to-digital conversion, an analog waveform is sampled, quantized, and 
coded (Figure 3.2-9). A quantizer is a function that assigns to each sample x, a value from a 


set Q = {y_Nn,---,Yo,---,yn} of 2N + 1 predetermined values [3-2]. Thus, an uncountably 
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Transform of X 


0.2 0.4 0.6 0.8 1 
x 


Figure 3.2-6 Plot of transformation y = g(x) = Foy(x) that transforms U[0, 1] into N(0, 1). 


Histogram of X: Uniformly Distributed [0,1] 


0 01 02 03 04 05 06 07 08 09 1 


Figure 3.2-7 Histogram of 1000 i.i.d. RVs distributed as U(0, 1]. 


infinite set of values (the analog input x) is reduced to a finite set (some digital output y;). 
Note that this practical quantizer is also a limiter, that is, for « greater than some yy or 
less than some y_y, the output is y = yn or y_n, respectively. 

A common quantizer is the uniform quantizer, which is a staircase function of uniform 
step size a, that is, 


g(x) = ia (i-l)a <a < ia, i an integer. (3.2-14) 


Thus, the quantizer assigns to each « the closest value of 7a above continuous sample value 
3x as is shown by the staircase function in Figure 3.2-10. 
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Histogram of Y 


t 
x(t) 7 >| Quantizer ee Coder ne 
Sampler 
x A x(t) 


Figure 3.2-9 An analog-to-digital converter. 


Output y(t) 


Input x(t) 


t 


Figure 3.2-10 Quantizer output (staircase function) versus input (continuous line). 
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Figure 3.2-11 Fx(y) versus Fy(y). 


If X is an RV denoting the sampled value of the input and Y denotes the quantizer 
output, then with a = 1, we get the output PMF 


The output CDF then becomes the staircase function 


Fy(y) = a Py (i)u(y — ) 


= VlFx@ — Fe Diu 9), (3.2-15) 


as sketched in Fig. 3.2-12. 
When y = n (an integer), Fy(n) = Fx(n), otherwise Fy (y) < Fx(y). 


Example 3.2-8 


(sine wave) A classic problem is to determine the pdf of Y = sin X, where X : U(—7,+7), 
that is, uniformly distributed over (—7,+77). From Figure 3.2-12 we see that for 0 < y <1, 
the event {Y < y} satisfies 


{Y¥ <y} = {sinX < y} 
={-4<X <sin'y}U{a—sin ty <X <q}. 


Since the two events on the last line are disjoint, we obtain 


Fy(y) = Fx(x) — Fx(a —sin7' y) + Fx(sin7' y) — Fx(—1). (3.2-16) 
Hence 
_ dFy(y) 
fy(y) = ce 
— 1 — 1 
= fx(m—sin Dp t Ix(sin Va (3.2-17) 
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1 1 1 1 
L (3.2-18) 
2m /1—y? 27 ,/1— 4/2 
1 1 
= —-— O<y<l. (3.2-19) 
TT L= y? 


If this calculation is repeated for —1 < y < 0, the diagram in Figure 3.2-12 changes to 
that of Figure 3.2-13. So the event {Y < y} = {sin X < y} expressed in terms of the RV X 
becomes 


{-n —sin"ly <X <sin' y}, (3.2-20) 


where we are now using the inverse sin appropriate for y < 0. Then we can write the 
following equation for the CDFs 


Fy(y) = Fx(sin7' y) — Fx(—m —sin7* y). 


sin x 


Figure 3.2-12 Graph showing roots of y= sinx whenO<y< 1. 


sin x 


Figure 3.2-13 Plot showing roots of y= sin x when —1 < y< 0. 
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Figure 3.2-14 The probability density function of Y= sin X. 


Upon differentiation, we obtain 


| 
a 
x 
@ 
=) 
= 
= 
7 - 
< 
1) 
— 
x 
aa 
= 
| 
< 
w 


which is the same form as before when 0 < y < 1. 

Finally we consider |y| > 1, and since |sin(xz)| < 1 for all z, we see that the pdf 
fy must be zero there. Combining these three results, we obtain the complete solution 
(Figure 3.2-14): 


1 


1 
—-—_—,, || <1, 

fyy)=4 TV1-y? 
0, 


(3.2-21) 
otherwise. 


We shall now go on to derive a simple formula that will enable us to solve many problems 
of the type Y = g(X) by going directly from pdf to pdf, without the need to find the CDF 
first. We shall call this new approach the direct method. For some problems, however, the 
indirect method of this past section may be less prone to error. 


General Formula of Determining the pdf of Y = g(X) 


We are given the continuous RV X with pdf fx(x) and the differentiable function g(a) of 
the real variable x. What is the pdf of Y S g(X)? 
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Solution The event {y < Y < y+ dy} can be written as a union of disjoint elementary 
events {£;} in the Borel field generated under X. If the equation y = g(a) has a finite 
number n of real roots! 71,...,%n, then the disjoint events have the form FE; = {x; — 
|dx;| < X < 2;} if g'(x;) is negative or E; = {x; < X < x; + |dz;|} if g/(a;) is positive.? 
(See Figure 3.2-15.) In either case, it follows from the definition of the pdf that P[E;] = 
fx (x;)|dx,;|. Hence 


Ply <Y <y+dy] = fr(y)|dyl 


=~ fx(xi)|dai| (3.2-22) 
i=1 
or, equivalently, if we divide through by |dy| 
” dx; ” dy fi 
Fly) = 2 Sx(es) dy = 2, Fx(xs) dz; 


At the roots of y = g(x), dy/dx; = g'(x;), and we obtain the important formula 
fry)=)>_feled/l'@o| ct =aily), (es) £0. (3.2-23) 
i=1 


Equation 3.2-23 is a fundamental equation that is very useful in solving problems where 
the transformation g(x) has several roots. Note that we need to make the assumption that 


g(x) 


x, — |dx,| 


Figure 3.2-15 The event {y < Y < y+ dy} is the union of two disjoint events on the probability 
space of X. 


+ By roots we mean the set of points x; such that y — g (ei) 0,8 = 1p eon gins 
tThe prime indicates derivatives with respect to x. 
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g'(a;) # 0 at all the roots. To see what happens otherwise, realize that a region where 
g' = 0 is a flat region for the transformation g. So for any «x in this flat region, the y value 
is identical, and that will create a probability mass at this value of y whose amount is equal 
to the probability of the event that X falls in this flat region. In terms of the pdf fy, the 
mass would turn into an impulse with area equal to the mass. 

If, for a given y, the equation y — g(x) = 0 has no real roots, then fy = 0 at that y.t 
Figure 3.2-15 illustrates the case when n = 2. 


Example 3.2-9 
(trig function of X) To illustrate the use of Equation 3.2-23, we solve Example 3.2-8 by 
using this formula. Thus we seek the pdf of Y = sin X when the pdf of X is fx(x) = 1/20 
for —1 < a < 7. Here the function g is g(x) = sin x. The two roots of y— g(x) = y—sina = 0 
for y > 0 are x; =sin"'y, x2 = 7 —sin‘ y. Also 


dg 

— = cos 

dx ‘ 
which must be evaluated at the two roots a, and v2. At x; =sin~'y we get dg/dz|,—2, = 
cos(sin~! y). Likewise when 22 = 7 — sin~' y, we get 
1 


aa 1 


= cos(m —sin7' y) = cos x cos(sin~* y) + sin x sin(sin7" y) 
Ho 
@L=2X2 


1 


= —cos(sin ~ y). 


The quantity cos(sin~' y) can be further evaluated with the help of Figure 3.2-16. There 
we see that 9 = sin”! y and cos @ = \/1 — y? = cos(sin ‘ y). Hence 


dg 
dx 


dg 
dx 


=vVJ/1-y’. 


Z1 zr 


Finally, fx(sin-'y) = fx(m —sin7*y) = 1/2m. Using these results in Equation 3.2-23 


enables us to write i 1 


mT /1—y? 
which is the same result as in Equation 3.2-19. Repeating this procedure for y < 0 then 
gives the same solution for all y as is given in Equation 3.2-21. 


fy(y) = 0<y¥<1, 


\ 
y 
Zo\ 
Vi-y? 


1 


Figure 3.2-16 Evaluating cos(sin ~~ y). 


+The RV_X, being real valued, cannot take on values that are imaginary. 
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y>0 


y<0 


Figure 3.2-17 Roots of g(x) = x” — y= 0 when nis odd. 


Example 3.2-10 
(nonlinear devices) A number of nonlinear zero-memory devices can be modeled by a trans- 
mission function g(a) = x”. Let Y = X”. The pdf of Y depends on whether n is even or 
odd. We solve the case of n odd, leaving n even as an exercise. For n odd and y > 0, the 
only real root to y — 2” = 0 is 2, = y!/”. Also 


dg A pgnl — ig A, 
dx 
For y < 0, the only real root is 2, = —|y|'/". See Figure 3.2-17. Also 
dg n— n 
ie njy|' oF 
Hence 
1 —n)/n n 
HyiM/n. fey"), -y 20, 
fry) = fl 
“Wl fel), 9 <0. 
In problems in which g(x) assumes a constant value, say g(x) = c, over some nonzero 


width interval Equation 3.2-23 cannot be used to compute fy (y) because g'(x) = 0 over the 
interval. One additionally has to find the probability mass generated by this flat section. 


Example 3.2-11 
(linear amplifier with cutoff.) Consider a nonlinear device with transformation as shown in 
Figure 3.2-18. 

The function g(a) is given by 


g(x) =0, a] >1 (3.2-24) 
O(a) =a; —b< 2 <1. (3.2-25) 
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Figure 3.2-18 A linear amplifier with cutoff. 


Thus g/(x) = 0 for |a| > 1, and g/(a) = 1 for -1 < wa < 1. For y > 1 and y < —1, there 
are no real roots to y — g(x) = 0. Hence fy(y) = 0 in this range. For —1 < y < 1, the only 
root to y— g(x) = y—x =0 is x = y. Hence in this range Equation 3.2-23 applies with 
\g'(z)| = 1 and so fy(y) = fx(y). We note that P[Y = 0] = P[X > 1]}+ P[X < —-1]. If 
X: N(0,1), P[X > 1] = 1/2 —erf(1) = P[X < -1] and so P[Y = 0] = 1 — 2erf(1) = 0.317. 
We would like to incorporate the result that P[Y = 0] = 0.317 into the pdf of Y. We can 
do this with the aid of delta functions realizing that 


O+e 
P[Y =0] =0.317 = lim 0.3176(y)dy. 


e—0 0-e 


Hence by including the term 0.3176(y) in fy(y) we obtain the complete solution as: 


0, ly} > 4, 
y= -1/2 1,2 
(277) exp (—5y7) + 0.3175(y), -—l<y<1. 


Example 3.2-12 
(infinite roots) Here we consider the periodic extension of the transformation shown in 
Figure 3.2-18. The extended g(a) is shown in Figure 3.2-19. 

The function in this case is described by 


[oe} 


g(x) = > (w—2n) rect (- =") ; 


n=—cCco 


Here rect is the symmetric unit-pulse function defined as 


pa { 1,-0.5 < a < 40.5, 


0, else. 


As in the previous example fy(y) = 0 for |y| > 1 because there are no real roots to 
the equation y — g(x) = 0 in this range. On the other hand, when —1 < y < 1, there 
are an infinite number of roots to y — g(x) = 0 and these are given by a, = y + 2n for 
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x= VY 


Figure 3.2-19 Periodic transformation function. 


n = ...,—2,—-1,0,1,2,.... At each root |g/(a,)| = 1. Hence, from Equation 3.2-23 we 
obtain fy(y) = 0°. fx(y + 2n) rect (4). In the case that X: N(0,1) this specializes to 


fy(y) = (Qn)7/? s exp (-Stu + 2n)') x rect (2) : 


n=—Cco 


While this result is correct, it seems hard to believe that the sum of infinite positive terms 
yields a function whose area is restricted to one. To show that fy(y) does indeed integrate 
to one, we proceed as follows: 


a fy (y)dy = =D L exp (-Su (y + 2n) ) dy (3.2-26) 


14+2n 
oe. > . exp Ga ) ay (3.2-27) 
- 2 [erf(1 + 2n) — erf(—1 + 2n)]. (3.2-28) 


If this last sum is written out, the reader will quickly find that all the terms cancel except 
the first (rn = —oo) and the last (n = co). This leaves that 


a fy (y)dy = erf(oo) — erf(—oo) = 2 x erf(oo) = 1. 


3.3 SOLVING PROBLEMS OF THE TYPE Z = g(X, Y) 


In many problems in science and engineering, a random variable Z is functionally related 
to two (or more) random variables X,Y. Some examples are 


1. The signal Z at the input of an amplifier consists of a signal X to which is added 
independent random noise Y. Thus Z = X + Y. If X is also an RV, what is the 
pdf of Z? (See Figure 3.3-1.) 
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Noise Y 


Figure 3.3-2 Displacement in the random-walk problem. 


2. A two-engine airplane is capable of flight as long as at least one of its two engines 
is working. If the time-to-failures of the starboard and port engines are X and Y, 
respectively, then the time-to-crash of the airplane is Z = max(X,Y). What is the 
pdf of Z? 

3. Many signal processing systems multiply two signals together (modulators, demod- 
ulators, correlators, and so forth). If X is the signal on one input and Y is the signal 
on the other input, what is the pdf of the output Z S xv? 

4. Inthe famous “random-walk” problem that applies to a number of important phys- 
ical problems, a particle undergoes random independent displacements X and Y 
in the x and y directions, respectively. What is the pdf of the total displacement 
Z 2 [X?+Y7]!/2? (See Figure 3.3-2.) 

Problems of the type Z = g(X,Y) are not fundamentally different from the type of 
problem we discussed in Section 3.2. Recall that for Y = g(X) the basic problem was to 
find the point set C, such that the events {¢: Y(¢) < y} and {¢: X(¢) € Cy} were equal. 
Essentially, the same problem occurs here as well: Find the point set C, in the (x, y) plane 
such that the events {¢: Z(¢) < z} and {¢: X(¢), Y(¢) € Cz} are equal, this being indicated 
in our usual shorthand notation by 


{Z < z} = {(X, x) € Cz} (3.3-1) 


and 


Fz(z)= i xy (x, y)dx dy. (3.3-2) 
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The point set C, is determined from the functional relation g(x,y) < z. Clearly in problems 
of the type Z = g(X,Y) we deal with joint densities or distributions and double integrals 
(or summations) instead of single ones. Thus, in general, the computation of fz(z) is 
more complicated than the computation of fy(y) in Y = g(X). However, we have access 
to two great labor-saving devices, which we shall learn about later: (1) We can solve 
many Z = g(X,Y)-type problems by a “turn-the-crank” type formula, essentially an 
extension of Equation 3.2-23, through the use of auailiary variables (Section 3.4); and 
(2) we can solve problems of the type Z = X + Y through the use of characteristic func- 
tions (Chapter 4). However, use of these shortcut methods at this stage would obscure the 
underlying principles. 
Let us now solve the problems mentioned earlier from first principles. 


Example 3.3-1 
(product of RVs) To find C, in Equation 3.3-2 for the CDF of Z = XY, we need to determine 
the region where g(a, y) 4 xy < z. This region is shown in Figure 3.3-3 for z > 0. 

Thus, reasoning from the diagram, we compute 


oo z/y 0 co 
Fz(z) -{ (a jevlouhte] ay +f ( - juve] dy for z>0. (3.3-3) 


To compute the density fz, it is necessary to differentiate this expression with respect to z. 
We can do this directly on Equation 3.3-3; however, to see this more clearly we first define 
the indefinite integral Gxy(a,y) by 


Gxy(z,y) 5 / fxy (a, y)de. (3.3-4) 


Figure 3.3-3 The region xy < z for z > 0. 
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Then 


Fee) = [ [Gxv (elu) ~ Gxv(-20,a)lty 


0 
+ [Gxv(co, 9) — Gxy(2/y, slay 


—oo 
and differentiation with respect to z is fairly simple now to get 


came 
= [Sher eluviaw. (3.3.5) 
-oo [yl 

We could have gotten the same answer by directly differentiating Equation 3.3-3 with respect 
to z using formula A2-1 of Appendix A. 

The question remains as to what is the answer when z < 0. It turns out that Equation 3.3- 
7 is valid for z < 0 as well, so that it is valid for all z. The corresponding sketch in the 
case when z < 0 is shown in Figure 3.3-4. From this figure, performing the integration 
over the new shaded region corresponding to {xy < z} now in the case z < 0, you should 
get the same integral expression for F'z(z) as above, that is, Equation 3.3-3. Taking the 
derivative with respect to z and moving it inside the integral over y, we then again obtain 
Equation 3.3-7. Thus, the general pdf for the product of two random variables for any value 
of z is confirmed to be 


f(z) =/ pixy @luuay — 00 <z<-+00. (3.3-6) 


As a special case, assume X and Y are independent, identically distributed (i.i.d.) RVs 
with 


x 


Figure 3.3-4 The region xy < z for z< 0. 
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fZ(z) 


—a? 0 a2 Zz 


Figure 3.3-5 The pdf fz(z) of Z= XY when X and Y are i.i.d. RVs and Cauchy. 


This is known as the Cauchy! probability law. Because of independence 


fxy(z,y) = fx(x) fy (y) 


and because of the evenness of the integrand in Equation 3.3-7, we obtain,? after a change 


of variable, 
a\2 f° 1 1 
= — . d 
fa(2) (=) 7 wZt+tar’x atta ™ 
2 1 2 
- (<) In =. (3.3-7) 


nt) 22—-at at 


See Figure 3.3-5 for a sketch of fz(z) for a= 1. 


Example 3.3-2 
(maximum operation) We wish to compute the pdf of Z = max(X,Y) if X and Y are 
independent RVs. Then 


Fz(z) = P[max(X,Y) < z]. 
But the event {max(X,Y) < z} is equal to {X < z,Y < z}. Hence 


PiIZ <2) =P[X <2,Y < 2] = Fx(z)Fy(z) (3.3-8) 
and by differentiation, we get 
fa(2) = fy (2) Fx (2) + fx (2) FY (2). (3.3-9) 
Again as a special case, let fx(«) = fy(x) be the uniform [0,1] pdf. Then 
fza(z) = 2z[u(z) — u(z — 1)], (3.3-10) 


which is ploted in Figure 3.3-6. The computation of Z = min(X,Y) is left as an end-of- 
chapter problem. 


+ Auguste Louis Cauchy (1789-1857). French mathematician who wrote copiously on astronomy, optics, 
hydrodynamics, function theory, and the like. 

tSee B. O. Pierce and R. M. Foster, A Short Table of Integrals, 4th ed. (Boston, MA: Ginn & Company, 
1956), p. 8. 
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fy(x) = f(x) 


= |------— 
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Z 


Figure 3.3-6 The pdf of Z= max(X, Y) for X, Yi.i.d. and uniform in [0, 1]. 


Figure 3.3-7 The pdf of the maximum of two independent exponential random variables. 


Example 3.3-3 
(maz of exponentials) Let X,Y be i.i.d. RVs with exponential pdf fx(x%) = e~*u(x). Let 
Z = max(X,Y). Compute fz(z) and then determine the probability P[Z < 1]. 
Solution From P[Z < z] = P[|X <2z,Y < z] = P[X < z]P[Y < z], we obtain 


Fz(z) = Fx(z)Fy(z) = (1 —e7*)?u(z) 


and ee 
fa(z) = A) = 2e-*(1 — e7*)u(z). 


The pdf is shown in Figure 3.3-7. Finally, Fz(1) = (1 — e~+)?u(1) = 0.4. 


Sec. 3.3. SOLVING PROBLEMS OF THE TYPE Z = g(X,Y) 177 


The sum of two independent random variables. The situation modeled by Z = X + Y 
(and its extension Z = aan X:) occurs so frequently in engineering and science that the 
computation of fz(z) is perhaps the most important of all problems of the type 
Z=G(X,Y). 

As in other problems of this type, we must find the set of points C, such that the 
event {Z < z} that, by definition, is equal to the event {X + Y < z} is also equal to 


{(X,Y) € C,}. The set of points C, is the set of all points such that g(x, y) Sat y sz 
and therefore represents the shaded region to the left of the line in Figure 3.3-8; any point 
in the shaded region satisfies 7 + y < z. 

Using Equation 3.3-2, specialized for this case, we obtain 


Fz(z) = i fxy (x, y)dax dy 


= ( a _ fev ess)de) dy 


= [Gxy(z—y,y) — Gxy(—ov, y)|dy, (3.3-11) 


—Co 


where Gxy(a,y) is the indefinite integral 


Gxy(z,y) 2 / fxy (a, y)de. (3.3-12) 


Les 


Figure 3.3-8 The region C, (shaded) for computing the pdf of 22 Xa ¥ 


178 Chapter 3 Functions of Random Variables 


The pdf is obtained by differentiation of F'z(z). Thus, 


fale) = F229 — [ Fiexy(e- vw 


= - fay (2 usu)dy. (3.3:13) 


Equation 3.3-13 is an important result (compare with Equation 3.3-6 for Z = XY). In 
many instances X and Y are independent RVs so that fxy(a,y) = fx(x)fy(y). Then 
Equation 3.3-13 takes the special form 


fale) = f ” Heide (3.3-14) 


which is known as the convolution integral or, more specifically, the convolution of fx with 
fy.’ It is a simple matter to prove that Equation 3.3-14 can be rewritten as 


fza(z) = [- fx (a) fy (z — x)da, (3.3-15) 


by use of the transformation of variables x = z — y in Equation 3.3-14. 


Example 3.3-4 
(addition of RVs) Let X and Y be independent RVs with fx(x) = e~7u(x) and fy(y) = 


1lu(y +1) — u(y —1)] and let Z 2X + Y. What is the pdf of Z? 


Solution A big help in solving convolution-type problems is to keep track of what is 
going on graphically. Thus, in Figure 3.3-9(a) is shown fx(y) and fy(y); in Figure 3.3-9(b) 
is shown fx(z—y). Note that fx(z-—y) is the reverse and shifted image of fx(y). How do 
we know that the point at the leading edge of the reverse/shifted image is y = z? Consider 


fx(z—y) =e © Yule —y). 


But u(z — y) = 0 for y > z. Therefore the reverse/shifted function is nonzero for (—oo, z] 
and the leading edge of fx(z—y) is at y = z. 

Since fx and fy are discontinuous functions, we do not expect fz(z) to be described by 
the same expression for all values of z. This means that we must do a careful step-by-step 
evaluation of Equation 3.3-14 for different regions of z-values. 


(a) Region 1. z<—1. For z < —1 the situation is as shown in Figure 3.3-10(a). Since 
there is no overlap, Equation 3.3-14 yields zero. Thus fz(z) = 0 for z < —1. 

(b) Region 2. —1<z<1.In this region the situation is as in Figure 3.3-10(b). Thus 
Equation 3.3-14 yields 


fa(z) = >| e FW dy 
2 Is 
— aT = ee), 
2 


+A common notation for the convolution integral as in Equation 3.3-15 is fz = fx * fy. 
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fy(z—y) 


Figure 3.3-9 (a) The pdf's fx(y), fy(y); (b) the reverse/shifted pdf fx(z— y). 


(c) Region 3. z > 1. In this region the situation is as in Figure 3.3-10(c). From 
Equation 3.3-14 we obtain 


1 1 
falz) =5 / eM dy 
2 Jan 


= alee ~ en (+0), 

Before collecting these results to form a graph we make one final important observation: 
Since no delta functions were involved in the computation, fz(z) must be a continuous 
function of z. Hence, as a check on the solution, the fz(z) values at the boundaries of the 
regions must match. For example, at the junction z = 1 between region 2 and region 3 


= ? —(z- —(z 
tile tert) lle @-l) _e¢ car) 


Z=1.— z=1- 


Obviously the right and left sides of this equation agree so we have some confidence in 
our solution (Figure 3.3-11). 


Equations 3.3-14 and 3.3-15 can easily be extended to computing the pdf of Z = aX + bY. 


To be specific, let a > 0, b > 0. Then the region g(a, y) S ar+ by < z is to the left of the 
line y = z/b — aa/b (Figure 3.3-12). 
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(b) 


(c) 


Figure 3.3-10 Relative positions fx(z— y) and fy(y) for (a) z< 1; (b) -l1<z<1;(c)z>1. 


f(z) 


Figure 3.3-11 The pdf fz(z) from Example 3.3-4. 


Hence 
Fz (z) = // fxy (x, y)dx dy 
g(@,y) Sz 


co z/a—by/a 
=| fely) (/ | jute) dy. (3.3-16) 
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Y : 
= 


DO 


TIN 


is 


Figure 3.3-12 The region of integration for computing the pdf of Z = aX+ bY shown for a > 0, 
b>0. 


As usual, to obtain fz(z) we differentiate with respect to z; this furnishes 


fale)= =f tx (2-%) fren (33.17) 


a 
where we assumed that X and Y are independent RVs. Equivalently, we can compute fz(z) 
by writing ‘ 
V=ax 
W Soy 
ZAV+W. 
Then again, assuming a > 0, b > 0 and X, Y independent, we obtain from Equation 3.3-14 


aa = f © f(z w) fw(w)dw 


where, from Equation 3.2-2, 


joy = Le (2). 
and i 7 

fw(w) = res (=) 
Thus, 


f= f- fx (—) fy (=) dw. (3.3-18) 
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Although Equation 3.3-18 doesn’t “look” like Equation 3.3-17, in fact it is identical to it. We 
need only make the change of variable y ow /bin Equation 3.3-18 to obtain Equation 3.3-17. 


Example 3.3-5 
(a game of Sic bo) In many jurisdictions in the United States, the taxes and fees from legal 
gambling parlors are used to finance public education, build roads, etc. Gambling parlors 
operate to make a profit and set the odds to their advantage. In the popular game of Sic 
bo, the player bets on the outcome of a simultaneous throw of three dice. Many bets are 
possible, each with a different payoff. Events that are more likely have a smaller payoff, 
while events that are less likely have a larger payoff. At one large gambling parlor the set 
odds are the following: 


60 to 1 
30 to 1 


Sum of three dice equals 4 or 17 
Sum of three dice equals 5 or 16 
Sum of three dice equals 6 or 15 (17 to 1); 
Sum of three dice equals 7 or 14 (12 to 1); 
Sum of three dice equals 8 or 13 (8 to 1); 

Sum of three dice equals 9 or 10 or 11 or 12 (6 to 1). 


) 
iF 
) 
) 


SOU es ts 
ES ESS 


For example, 60 to 1 odds means that if the player bets one dollar and the event 
occurs, he/she gets 60 dollars back minus the dollar ante. It is of interest to calculate the 
probabilities of the various events. 


Solution All the outcomes involve the sum of three i.i.d. random variables. Let X,Y, Z 
denote the numbers that show up on the three dice, respectively. We can compute the 
result we need by two successive convolutions. Thus, for the sum on the faces of two 


dice, the PMF of X + Y, Px+y(I) 8 Sy Px(l—7)Py(t) and the result is shown in 
Figure 3.3-13. To compute the PMF of the sum of all three RVs Px;y+z(n), we perform 


X+Y (n) 


6/36 
5/36 


4/36 
3/36 


2/36 


1/36 


1 2 3 4 5 6 7 8 9 10 11 12 
Sum on the faces of two dice, / 


Figure 3.3-13 Probabilities of getting a sum on the faces of two dice. 
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216Px+y+z (1) 


34 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ™ 


Figure 3.3-14 Probabilities of getting a sum on the faces of three dice. 


a second convolution as Pyyy4z(n) = Se Pz(n —1)Px+y(t). The result is shown in 
Figure 3.3-14. From the second convolution, we obtain the probabilities of the events of 
interest. 

We define a “fair payout” (FP) as the return from the house that, on the average yields 
no loss or gain to the bettor.' If E is the event the bettor bets on, and the ante is $1.00, 
then for an FP the return should be 0, so 0 = —$1.00 + FP x P[E]. So FP = 1/PIE]. 

We read the results directly from Figure 3.3-14 to obtain the following: 


1. Getting a sum of 4 or 17 (you can bet on either but not both) has a win probability 
of 3/216 or a fair payout of 72:1 (compare with 60:1). 
2. Getting a sum of 5 or 16 (you can bet on either but not both) has a win probability 
of 6/216 or a fair payout of 36:1 (compare with 30:1). 
3. Getting a sum of 6 or 15 (you can bet on either but not both) has a win probability 
of 10/216 or a fair payout of 22:1 (compare with17:1). 
4. Getting a sum of 7 or 14 (you can bet on either but not both) has a win probability 
of 15/216 or a fair payout of 14:1 (compare with 12:1). 
5. Getting a sum of 8 or 13 (you can bet on either but not both) has a win probability 
of 21/216 or a fair payout of 10:1 (compare with 8:1). 


tObviously the house needs to make enough to cover its expenses for example, salaries, utilities, etc. 
The definition of a “fair payout” here ignores these niceties. Also the notion of average will be explored in 
some detail in Chapter 4. 
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6. Getting a sum of 9 or 12 (you can bet on either but not both) has a win probability 
of 25/216 or a fair payout of 9:1 (compare with 6:1). 

7. Getting a sum of 10 or 11 (you can bet on either but not both) has a win probability 
of 27/216 or a fair payout of 8:1 (compare with 6:1). 


Example 3.3-6 
(square-law detector) Let X and Y be independent RVs, both distributed as U(—1,1). 


Compute the pdf of V 2 (X +Y)?. 


Solution We solve this problem in two steps. First, we compute the pdf of Z =e +Y; 
then we compute the pdf of V = Z?. Using the pulse-width one rect function (see def. on 
p. 170), we have 


From Equation 3.3-14 we get 


f2(z) = if. rect (2) rect (5) dy. (3.3-19) 


a 
The evaluation of Equation 3.3-19 is best done by keeping track graphically of where the 
“moving,” that is, z-dependent function rect((z—y)/2), is centered vis-a-vis the “stationary,” 
that is, z-independent function rect(y/2). The term moving is used because as z is varied, 
the function fx ((z—y)/2) has the appearance of moving past fy (y). The situation for four 
different values of z is shown in Figure 3.3-15. 

The evaluation of fz(z) for the four distinct regions is as follows: 


(a) z< —2. In this region there is no overlap so 


(b) —2 <z<0. In this region there is overlap in the interval (—1, z +1) so 


z+1 
fae)=3 | dy = T(z +2). 
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fx(Z-y) fyly) , 
[> ——=-—— I 
| } | I 2 
| _ | 
| i ] I 
i | 
z-1 z+1-1 1 y 
(a) 
\ 
| | 
! \ ! 
| | 
Z=1 = zt 1 y 
(b) 
I I 
I I 
| al 
-—1 z-1 1 z+1 y 
(c) 
fy(y) fx (Z—y) 
| | l | 
! a | 
I | — 
| | | | 
| | | | 
=A 1 Z=1 z+1ey 
(d) 


Figure 3.3-15 Four distinct regions in the convolution of two uniform densities: (a) z < —2; 
(b) —-2<z<0; (c)O<z< 2; (d)z>2. 


(c) 0< z< 2. In this region there is overlap in the interval (z — 1,1) so 


Lae 1 
fa(z) = if. dy = q2- 2): 
(d) 2 < z. In this region there is no overlap so 
fz(z) =0. 
If we put all these results together, we obtain 
fz(z) = 72 — |z|)rect (=) (3.3-20) 


which is graphed in Figure 3.3-16. 
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f(z) 


an 
2 


Figure 3.3-16 The pdf of Z= X+ Y for X, Yi.i.d. RVs uniform on (—1,1). 


fy(v) 


0.25 


0 1 4 V 


Figure 3.3-17 The pdf of Vin Example 3.3-6. 


To complete the solution to this problem, we still need the pdf of V = Z?. We compute 
fv(v) using Equation 3.3-19 with g(z) = 2%. For v > 0, the equation v — z? = 0 has two 
real roots, that is, z) = /v, z2 = —/v; for v < 0, there are no real roots. Hence, using 
Equation 3.3-20 in 


t=1 
yields 
Loe 2 
—{—-1 <4 
fv(v)=% 4 (= ) O<vs4, (3.3-21) 
0, otherwise, 


which is shown in Figure 3.3-17. 


The pdf of the sum of discrete random variables can be computed by discrete convo- 
lution. For instance, let X and Y be two RVs that take on values x1,...,%,%,... and 
Y1y+++sYjs-++, respectively. Then Z 2% + Y is obviously discrete as well and the PMF is 
given by 
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Pz(2n) = > Px,y (£k; j)- (3.3-22) 


Le+Y;=Zn 
If X and Y are independent, Equation 3.3-22 becomes 


Pz(Zn) = > Px (ax) Py (yj )= S > Px(2x)Py(2n — Lk). (3.3-23a) 


CpTYj=2Zn Lk 


If the z,’s and x;’s are equally spaced! then Equation 3.3-23a is recognized as a discrete 
convolution, in which case it can be written as 


Pz(n) = S> Px(k)Py(n— k). (3.3-23b) 
all k 


An illustration of the use of Equation 3.3-23b is given below. 


Example 3.3-7 
(sum of Bernoulli RVs) Let By, and Bz be two independent Bernoulli RVs with common 
PMF 


p,k=1, 
Pp(k) =< ¢g,k=0, whereg=1-—p. 
0, else, 


Let WZ = B, + By and find the PMF Pj;(m). We start with the general result 


Py(m)= S° Pp,(k)Pp,(m—k) 


= SO Pp, (k)Pp,(m— k). 


Since each B; can only take on values 0 and 1, the allowable values for M are 0, 1, and 2. 
For all other values of m, Pyy(m) = 0. This can also be seen graphically from the discrete 
convolution illustration in Figure 3.3-18. 

Calculating the nonzero values of PMF Py,;, we obtain 


Py (0) = Pp, (0)Pp, (0) = ¢° 
Py (1) = Pp, (0) Pp, (1) + Pa, (1) Pp, (0) = 2p¢ 
Pyg(2) = Pp, (1) Pp, (1) = p?. 


The student may notice that M is distributed as binomial b(k;2,p). Why is this? What 
would happen if we summed in another independent Bernoulli RV? 


+For example, let zn = nA, xp = kA, A a constant. 
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Fb) 
q 
p 
| 
e e e 0 71 2 2 o> 
B, (m-b) 
lp Slide to right 
° o—_© __@ _» o> 
m-1 m 0 1 2 


Figure 3.3-18 Illustration of discrete convolution of two Bernoulli PMFs. 


Example 3.3-8 
(sum of Poisson RVs) Let X and Y be two independent Poisson RVs with PMFs Px(k) = 
qe “a* and Py(t) = te Pbt, where a and b are the Poisson parameters for X and Y, 


respectively. Let Z 2X +Y. Then the PMF of Z, Pz(n) is given by 


a a | 
=e aor’, (3.3-24) 
Recall the binomial theorem: 
3 y | ako" = (a +5)" 3.3-25 
,)¢ =(a+b)”. (3. ) 


Then 


=e oe) = ow SO, (3.3-26) 


which is the Poisson law with parameter a+b. Thus, we obtain the important result that the 


sum of two independent Poisson RVs with parameters a, b is a Poisson RV with parameter 
(a+b). 
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Example 3.3-9 
(sum of binomial random variables) A more challenging example than the previous one 
involves computing the sums of i.i.d. binomial RVs X and Y. Let Z = X + Y; then the 
PMF Pz(m) is given by 


Pz(m)= S> Px(k)Py(m-—k), 
k=—oo 
where 
0, k <0, 
Px(k) = Py(k) = ,) po, USksn, 
0, k>n 
Thus, 


min(n,m) 
= n k n—k n m—k n—(m—k) 
Pz(m) = 5 ({,) oa (in 4)? q 


k=max(0,m—n) 


min(n,m) 
_— .m_~2n—m nr nm 
-wem  te) (mca) 


k=max(0,m—n) 


The limits on this summation come from the need to inforce bothO <k<nand0<m-k< 
n, the latter being equivalent to m—n < k < m. Hence the range of the summation must 
be max(0,m —n) < k < min(n,m) as indicated. 


Somewhat amazingly 
min(n,m) 
n n 2n 
ee) ee al aan 


k=max(0,m—n) 


so that we get obtain the PMF of Z as 
Pz(m) = (7) prgr—™ © b(m; 2n,p). (3.3-28) 


Thus, the sum of two i.i.d. binomial RVs each PMFs 6(k;n,p) is a binomial RV with PMF 
given as b(k; 2n, p). 

To show that Equation 3.3-27 is true we first notice that the left-hand side (LHS) has 
the same value whether m > n (in which case the sum goes from k = m—n up to k = n) 
or whether m <n (in which case the sum goes from k = 0 up to k = m). A simple way to 
see this is to expand out the LHS in both cases. Indeed an expansion of the LHS for m < n 


GC) Jee(G)(9). 


Doing the expansion in the case m > n to yields the same sum. 
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Proceeding with the verification of Equation 3.3-27, note that the number of subpopu- 
lations of size m that can be formed from a population of size 2n is C2”. But another way 
to form these subpopulations is to break the population of size 2n into two populations 
of size n each. Call these two populations of size n each, A and B, respectively. Then the 
product C7C?’_,, is the number of ways of choosing k subpopulations from A and m— k 
from B. hen clearly 


do CRC = Cnt (3.3-30) 


k=0 


and since, as we said earlier, 


2 Cy m—k — = do cten m— ig" 


k=m—n 


the result in Equation 3.3-27 is equally valid when k goes from m—n to n. 

In Chapter 4 we will find a simpler method for showing that the sum of i.i.d. binomial 
RVs is binomial. The method uses transformations called moment generating functions 
and/or characteristic functions. 


We mentioned earlier in Section 3.2 that although the formula in Equation 3.3-23a 
(and its extensions to be discussed in Section 3.4) is very handy for solving problems 
of this type, the indirect approach is sometimes easier. We illustrate with the following 
example. 


Example 3.3-10 


(sum of squares) Let X and Y be ii.d. RVs with X:N(0,07). What is the pdf of Z a 
xX? +4Y?? 


Solution We begin with the fundamental result given in Equation 3.3-2: 


z)= I fxy(a,y)dady for z>0 
(a,y)ECz 


=a —(1/20*)(@ +9") dee dy, (3.3-31) 
~ Ino nies 


The region C, consists of the shaded region in Figure 3.3-19. 
Equation 3.3-31 is easily evaluated using polar coordinates. Let 


x =rcosé y=rsing 


da dy — rdrdé. 


+ This formula can also be verified by using the change of variables | 4 m—k in the RHS. The resulting 
sum will run from large to small, but reversing the summation order does not affect a sum. 
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Figure 3.3-19 The region C, for the event {X* + Y* < z} for z> 0. 


Then 2? + y? < zr < /z and Equation 3.3-31 is transformed into 


i. 7 ve i 
Fz(z) = ra ao | rexp (-sar ) ar 


= [1 — e7#/?*"Ju(z) (3.3-32) 


and 


= dFz(z) 1 


fz(z) a = spe 7? ule). (3.3-33) 


Thus, Z = X? + Y? is an exponential RV if X and Y are i.i.d. zero-mean Gaussian. 
Example 3.3-11 
(squareroot of sum of squares) If the previous example is modified to finding the pdf of 
Z= (X? 4+ Y?)!/?, a radically different pdf results. Again we use Equation 3.3-2 except 
that now C, consists of the shaded region in Figure 3.3-20. 

Thus, 


Fz(z) = 


(3.3-34) 


Figure 3.3-20 The region C, for the event {(X? + Y’)!/? < z}. 
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f(z) 


0.606 


oOo 


Rayleigh 


Exponential 


Figure 3.3-21 Rayleigh and exponential pdf's. 


and 


fz(2) = aye Fe ala), (3.3-35) 


oO 


which is the Rayleigh density function. It is also known as the Chi-square distribution with 
two degrees of freedom. The exponential and Rayleigh pdf’s are compared in Figure 3.3-21. 


Stephen O. Rice [3-3], who in the 1940s did pioneering work in the analysis of electrical 
noise, showed that narrow-band noise signals at center frequency w can be represented by 
the wave 


Z(t) = X coswt + ¥ sinwt, (3.3-36) 


where ¢ is time, w is the radian frequency in radians per second and where X and Y are 
iid. RVs distributed as N(0,02). The so-called envelope Z 2 (X? + Y?)1/2 has, therefore, 
a Rayleigh distribution with parameter o. 

The next example generalizes the results of Example 3.3-10 and is a result of consider- 
able interest in communication theory. 


*Example 3.3-12! 
(the Rician density)* S. O. Rice considered a version of the following problem: Let X: 
N(P, o7) and Y: N(0,07) be independent Gaussian RVs. What is the pdf of Z = 
(X? + Y?)!/2? Note that with power parameter P = 0, we obtain the solution of Example 
3.3-L1. 


We write 
1 1ffe-P]? yy? 
=e dx dy, > 0, 
F(z) —_ 2702 i a 2 (| oO | * () ) sis p (3.3-37) 
0, z<0. 


*Starred examples are somewhat more involved and can be omitted on a first reading. 
+Sometimes called the Rice-Nakagami pdf in recognition of the work of Nakagami around the time of 
World War II. 
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The usual Cartesian-to-polar transformation z = rcos@, y = rsinO, r = (x? + y?)'/?, 


6 = tan~!(y/z) yields 


1(P)? 
exp |— 3 (4) . 1 2 a 2 
F(z) = ail, peer (| pre eee i) rdr-u(z). (3.3-38) 
0 0 


Qa? 


The function 
A 1 20 


x cos @ 
s_— do 
20 0 . 


I,(z) 
is called the zero-order modified Bessel function of the first kind and is monotonically 
increasing like e*. With this notation, the cumbersome Equation 3.3-38 can be rewritten as 


P\2 


F(z) = oa rt (=) Gi 2CID ap aid), (3.3-39) 


o? o 
where the step function u(z) ensures that the above is valid for all z. To obtain fz(z) we 
differentiate with respect to z. This produces 


fa(z) = = exp | ; (= uA =)| a (=) uz). (3.3-40) 


oO 


The pdf given in Equation 3.3-40 is called the Rician probability density. Since [,(0) = 1, 
we obtain the Rayleigh law when P = 0. When zP > o?, that is, the argument of I,(-) is 


large, we use the approximation 
x 


to obtain 


1/2 

1) pg () "eter 
2ro2 \P 

which is almost Gaussian [except for the factor (z/P)1/?]. This is the pdf of the envelope 

of the sum of a strong sine wave and weak narrow-band Gaussian noise, a situation that 

occurs not infrequently in electrical communications. 


3.4 SOLVING PROBLEMS OF THE TYPE V = g(X, Y), W = h(X,Y) 


The problem of two functions of two random variables is essentially an extension of the 
earlier cases except that the algebra is somewhat more involved. 


Fundamental Problem 


We are given two RVs X, Y with joint pdf fxy(z,y) and two differentiable functions 
g(x,y) and h(x, y). Two new random variables are constructed according to V = g(X,Y), 
W = h(X,Y). How do we compute the joint CDF Fyw/(v,w) (or joint pdf fyw(v, w)) of 
V and W? 
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Figure 3.4-1 A two-variable-to-two-variable matrixer. 


Illustrations. 1. The transformation shown in Figure 3.4-1 occurs in communication 
systems such as in the generation of stereo baseband systems [3-2]. The {a;} are gains. 
When a; = a2 = cos@ and a3 = ag = sinJ8, the circuit is known as a 6-rotational trans- 
former. In another application if X and Y are used to represent, for example, the left and 
right pick-up signals in stereo broadcasting, then V and W represent the difference and 
sum signals if all the a,;’s are set to unity. The sum and difference signals are then used 
to generate the signal to be transmitted. Suppose for the moment that there are no source 
signals and that X and Y therefore represent only Gaussian noise. What is the pdf of V 
and W? 

2. The error in the landing location of a spacecraft from a prescribed point is denoted by 
(X, Y) in Cartesian coordinates. We wish to specify the error in polar coordinates V = (X24 
y2)'/2, w 2 tan-1(¥/X). Given the joint pdf fxy(x,y) of landing error coordinates in 
Cartesian coordinates, how do we compute the pdf of the landing error in polar coordinates? 

The solution to the problem at hand is, as before, to find a point set C,,, such that 
the two events {V < v,W < w} and {(X,Y) € Cyw} are equal. Thus, the fundamental 
relation is 


PIV <v,W <u] 2 Fywlv,w) 


=f[[__ tev(eyadeay, (3.41) 
(2,y)ECow 
The region Cy» is given by the points x, y that satisfy 


Cow = {(,y): g(@,y) Sv, h(x, y) < wh. (3.4-2) 
We illustrate the application of Equation 3.4-1 with an example. 


Example 3.4-1 
(sum and difference) We are given V 2 X+Y and W 2 X-Y and wish to calculate the pdf 
fvw(v,w). The point set Cy, is described by the combined constraints g(x, y) Sat y Sv 


and h(x, y) 4y- y < w; it is shown in Figure 3.4-2 for v > 0, w > 0. 


In more elaborate notation, we would write {¢: V(¢) < v and W(¢) < w} = {¢: (X(¢), Y(Q) € Cow}. 
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< 


Zp». 


Figure 3.4-2 Point set Cv (shaded region) for Example 3.4-1. 


The integration over the shaded region yields 


(v+w)/2 v—-x 
Fyw(v,w) = / (/ fev ea)dy) dz. (3.4-3) 
To obtain the joint density fyw(v,w), we use Equation 2.6-30. Hence 
Fvw(v, w) 
fyuw(v,w) = — aaa. 


62 (v+w) /2 v-2x 
~ Ov al (/ fxy(esa)dy) dx 
fa) ra) (v+w)/2 v-x 
~ au Zz / (/ fxy(es)dy) da 
0 1 (v—w)/2 v+tw (u+w)/2 a v—-x 
= ap E [..... fxy ( 5) ss)dy + [ (= / fev(esu)dv) dx 


ra) (u+w)/2 


because the first integral is zero for continuous RVs X and Y, 


(v+w)/2 


— af fxy (a, Ce w)dx 


1 vtw v—w 
= 4-4 
sfxy ( a) : (3.4-4) 
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where use has been made of the general formula for differentiation of an integral (see 
Appendix A.2.). Thus, even this simple problem, involving linear functions and just two 
RVs, requires a considerable amount of work and care to obtain the joint pdf solution. For 
this reason, problems of the type discussed in this section and their extensions to n RVs, 
that is, Yr = gi(X1,..-,Xn), Yo = go(M%,...,Xn),---5 Yn = Gn(X1,..-, Xn), are generally 
solved by the technique discussed next called the direct method for joint pdf evaluation. 
Essentially it is the two-dimensional extension of Equation 3.2-23. 


Obtaining fyy Directly from f xy 


Instead of attempting to find fyw(v,w) through Equation 3.4-1, we can instead take a 
different approach. Consider the elementary event 


{fu<V<v4du,w<W<w+dw} 


and the one-to-one! differentiable functions v = g(x,y), w = h(x, y). The inverse mappings 
exist and are given by « = ¢(v,w), y = W(v, w). Later we shall consider the more general 
case where, possibly, more than one pair of (x;, y;) produce a given (v,w). 

The probability Plu < V < v+duv,w < W < w+ du] is the probability that V 
and W lie in an infinitesimal rectangle of area dudw with vertices at (v,w), (v + dv,w), 
(v,w + dw), and (v+ dvu,w + dw). The image of this square in the x, y coordinate system 
ist an infinitesimal parallelogram with vertices at 


Pi = (2,y), 


0 O 
Pps (: + By yt Fea) ; 


P3 = (« OO ion sew) 
w w 


) ) 
_ ao, I _ op, , Op 
Pp= € had Ay ey Ay Few) 


This mapping is shown in Figure 3.4-3. 

With .# denoting the rectangular region shown in Figure 3.4-3(a) and .” denoting 
the parallelogram in Figure 3.4-3(b) and A(.#) and A(.7%) denoting the areas of .# and 
Y respectively, we obtain 


Plo <V <vtdjw<W<w+du|= If fuw(E,n)dé dn (3.4.5) 
= fw(v,w)A(Z) (3.46) 
= ff tev 6.mae dy (3.4.7) 
= fur (2,9 ACY). (3.48) 


i Every point (x,y) maps into a unique (v, w) and vice versa. 
*See for example [3-4, p.769] 
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v = Constant 


v + dv= Constant 


w +dw = Constant 


w = Constant 


(a) (b) 


Figure 3.4-3 An infinitesimal rectangle in the v, w system (a) maps into an infinitesimal parallelogram 
(b) in the x, y system. 


Equation 3.4-5 follows from the fundamental relation given in Equation 3.4-1; Equation 
3.4-6 follows from the interpretation of the pdf given in Equation 2.4-6; Equation 3.4-7 
follows by definition of the point set .”, that is, .” is the set of points that makes the 
events {(V,W) € .#} and {(X,Y) €.% } equal; and Equation 3.4-8 again follows from the 
interpretation of pdf. 

From Equations 3.4-6 and 3.4-8, we find that 


oO 


fvw(v,w) = tp ixv (ey), (3.4-9) 


where x = ¢(v,w) and y = w(v, w). 

Essentially then, all that remains is to compute the ratio of the two areas. This is done in 
Appendix C. There we show that the ratio A(”’)/A(.#) is the magnitude of a quantity called 
the Jacobian of the transformation x = ¢(v, w), y = W(v, w) and given the symbol J. If there 
is more than one solution to the equations v = g(x,y), w = h(a,y), say, 11 = ¢,(v,w), 
w= ¥1(v, w), 2 = bo(v, w), 2 = Wo(v,w),---,2n = on (v, w), Yn = Wn (v, w), then Kb 
maps into multiple, disjoint infinitesimal regions .%,.%,...,.% and A(.%) We KR) = \Jil, 
i=1,...,n. The |J;| are often written as the magnitude of daterminaaita. that is, 


ae 0¢,/Ov 0¢,/Ow 
IJ] = M28) oy. (Bu Bub, (Br 


The end result is the important formula 


= |0¢,/dv x Oy, /Ow — IW, /Iv x ¢;/Ow|. —_ (3.4-10) 


fuw( U, w) See ri, yi) | Jil. (3.4-11) 
i=1 


It is shown in Appendix C that |.J,1| = |Ji| S |0g/Ox; x Oh/Oy; — Og/Oy; x Oh/Ox;|. Then 


we get the equally important formula 


fuw( VU, w) 5 ae Ui, Yi )/| Fil. (3.4-12) 
w=1 
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Example 3.4-2 
(linear functions) We are given two functions 


v 2 g(x,y) = 3a +5y 


we h(a,y) = «+ 2y (3.4-13) 


and the joint pdf fxy of two RVs X, Y. What is the joint pdf of two new random variables 
V=9(X,Y),W =h(X,Y)? 


Solution The inverse mappings are computed from Equation 3.4-13 to be 
x= o(v,w) = 2v — 5w 
y = W(v,w) = -v+ 38w. 


Then 


6. O¢ Oy Oy 
By! Bw Pi ay tBu 
and 
~ 2 —-5 
|J| = mag “1 lee 


Assume fxy (x,y) = (27)~' exp[—$(2? + y?)]. Then, from Equation 3.4-11 


fvw(v, w) = ~ exp | 5l(20 5w)? + (—u+ 30)? 


1 
exp (5u2 — 26uw + 34w?)} . 
27 2 


Thus, the transformation converts uncorrelated Gaussian RVs into correlated Gaussian RVs. 


Example 3.4-3 
(two ordered random variables) Consider two i.i.d., continuous random variables with pdf’s 
fx, (a) = fx, (©) = fx(x). We define two new random variables as Y; = min(X1, X2) and 
Y2 = max(X1, X2). Clearly Y; < Yi meaning that realizations of Y; are always less than 
realizations of Y2. We seek the joint pdf, fy,y.(y1, y2), of Yi, Yo given that 


Y; = g(X1, X2) = min(X1, X2) 

Y = h(X1, X2) = max(X,, X92). 
Solution From Figure 3.4-4 (only the first quadrant is shown for convenience but all 
four quadrants must be considered in any calculation), we see that there are two disjoint 


real-number, regions and hence two solutions. We note that in .#,, 2, > x2 while in .%o, 


tWe ignore the zero-probability event Y; = Y2. 
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Figure 3.4-4 Showing the two regions of interest for Example 3.3-6. 


Ly < X2. Thus, in .#%, we have y, = £2, yo = 2; or, in the g,h notation, yy = g1(%1, v2) = 
2,92 = hi(x1,%2) = x1. The Jacobian magnitude of this transformation is unity so that 
fYie (yi, y2) = fxix2(ye, 41) = fx (y2)fx(y1), Yr < Y2. 

Repeating this analysis for #2, we have y; = g2(#1, #2) = %1, yo = ho(#1, 22) = x2 and 
once again the Jacobian magnitude of this transformation of unity. Hence fy,y,(y1,y2) = 
ffxi xXo(41, 92) = fx (yi) fx (y2), 41 < ya. As always we sum the solutions over the different 
roots/regions (here there are two) and obtain 


2 »TOO< < Sey 
fiva(ousta) = { ee else. — 


Question for the reader: We know that X, and X»2 are independent; are Y; and Y2 indepen- 
dent? 


Example 3.4-4 
(marginal probabilities of ordered random variables) In the previous example we ordered 
two iid. RVs X1,X2 as Yi, Y2, where Y; < Y2. The joint pdf Y,,Y2 was shown to be 
fyiyo(y1, y2) = 2fx(y1) fx (y2),-00 < y1 < yo < oo. Here we wish to obtain the marginal 
pdf’s of Yj, Yo. 


Solution 
To get fy, (yi) we have to integrate out fy,y¥5(y1, y2) = 2fx(y1) fx (y2), over all y2 > yr. 
Hence 


Co 


fy, (y1) = 2hx(us) f fx (y2)dy2 =2fx(y1) (1 — Fx(y1)), —00 < y < co. 


Y1 


Likewise, to get fy,(y2) we integrate out fy, y.(y1, y2) = 2fx(y1) fx (y2), over all yi < yo. 
The result is 


Yy2 
fya(y2) = 2fx (y2) fx(yi)dy1 =2fx (ye) Fx (y2), —0o < yo < oo. 


—cCo 
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Example 3.4-5 
(the minimum and maximum of two Normal random variables) We wish to see what the pdfs 
of ordered Normal RVs look like. To that end let X 1, X2 be i.i.d. Normal N(0,1) RVs pdfs 
and define Y; = min(X1, X2) and Y2 = max(X1, X2). Using the results of Example 3.4-5 we 
graph these pdf’s together with the Normal pdf on the same axes. The curves in Figure 3.4-5 
were obtained using the program Microsoft Excel’. The reader may want to duplicate these 


curves. 


pdfs of standard Normal, maximum of two standard Normals, 
and minimum of two standard Normals 


0.6 - 
0.5 - 


Arugument 


Figure 3.4-5 The pdf of min(X1, X2) peaks at the left of the origin at —0.5 while the pdf of max(X1, X2) 
peaks at the right of the origin at 0.5. Note that Var[min(X;, X2)] = Var[min(Xi, X2)] < 1. 


3.5 ADDITIONAL EXAMPLES 


To enable the reader to become familiar with the methods discussed in Section 3.4, we 
present here a number of additional examples. 


Example 3.5-1 
(magnitude and angle) Consider the RVs 


V 2 9(X,Y) = VX24+Y¥? (3.5-1) 


tan} (x) j X > 0, 
W =h(X,Y)= y (3.5-2a) 
tan} (x) +a, X <0. 


The RV V is called the magnitude or envelope while W is called the phase. Equation 3.5-2a 
has been written in this form because we seek a solution for w over a 27 interval and the 
inverse function tan~!(y/z) has range (—7/2,7/2) (i.e., its principle value). 


TExcel is available with Microsoft Office. The instruction to use Excel are available with the program. 
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To find the roots of 
(3.5-2b) 


TT 


we observe that for x > 0, we have =F <ws and cosw > 0. Similarly, for « < 0, 
30 


5 <w < 3} and cosw <0. Hence the only solution to Equation 3.5-2b is 


lA 


a =veosw & o(v, w) 
y=vsinw © v(v, w). 
The Jacobian J is given by 


0(¢,) _ |cosw  —vsinw| _ 
(v,w)  |sinw vcosw | 


: 


Hence the solution is, from Equation 3.4-11, 
fvw(v, w) = vfxy(vcosw, vsinw). (3.5-3) 
Suppose that X and Y are iid. and distributed as N (0,07), that is, 


1 


(2? +y?)/207] 
Qno? , 


eo 


fxy (a, y) => 
Then from Equation 3.5-3 


( v ae il +0 T ce 37 
2€ ee (3.5-4) 
0, otherwise 


I 


fvw(v,w) 


= fv(v) fw(w). 


Thus, V and W are independent random variables. The envelope V has a Rayleigh pdf and 
the phase W is uniform over a 27 interval. 


Example 3.5-2 
(magnitude and ratio) Consider now a modification of the previous problem. Let V a 


JX?+Y2 and WS Y/X. Then with g(#,y) = Vx? + y? and h(x, y) = y/x, the equations 


v — g(w,y) =0 
w—h(a,y) =0 
have two solutions: 
ay =v(1+ ww), Yi = We, 
tg = —v(1 + w?)7/?, Yo = WL2 


for —co < w < o and v > 0, and no real solutions for v < 0. 
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A direct evaluation yields |.Ji| = |Jo| = (1 + w?)/v. Hence 


U 


fuw(v,w) = Taga lixy (1, yi) + fxy (2, y2)]- 


With fxy(2,y) given by 


fury (0,y) = ——> exp[-(a? + y*)/20, 


210 
we obtain 
Vi _ ay? /O62 1/n 
few(v,w) = eV Pt u(v) 
= fv(v)fw(w). 


Thus, the random variables V, W are independent, with V Rayleigh distributed as in 
Example 3.5-1, and W Cauchy distributed. 


Example 3.5-3 
(rotation of coordinates) Let 0 be a prescribed angle and consider the rotational transfor- 
mation 


V2 Xcos6+Ysind 
Ww 2 Xsin0—Y cos (3.5-5) 
with X and Y i.i.d. Gaussian, 


1 


: e((2? +9") /207]_ 
210 


fxy(a,y) = 
The only solution to 
v= xcosé+ ysind 
w = axsind — ycosé 
is 
x =vcosé+wsind 


y = vsind — woos 6. 


The Jacobian J is 


Ox Ox 
du dw cos@ sind | __ = 
Oy Oy} |sin@ —cosO} 


Ov Ow 
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Hence 
= —[(v? +?) /207] 
fuw(v, w) = Cas . 
Thus, under the rotational transformation V = g(X,Y), W = A(X,Y) given in Equa- 
tion 3.5-5, V and W are i.i.d. Gaussian RVs just like X and Y. If X and Y are Gaussian 
but not independent RVs, it is still possible to find a transformation! so that V, W will be 


independent Gaussians if the joint pdf of X, Y is Gaussian (Normal). 


Example 3.5-4 
Consider again the problem of solving for the pdf of Z = VX? + Y2 as in Example 3.3-11. 
This time we shall use Equation 3.4-11 to somewhat indirectly compute fz(z). First we 
note that 7 = /X?+ Y? is one function of two RVs while Equation 3.4-11 applies to two 
functions of two RVs. To convert from one kind of problem to the other, we introduce an 


auxiliary variable W 2 X. Then 


Z49(X,Y) = VX? +Y? 
W FSAX,Y) =X. 
The equations 
z—g(z,y) =0 
w—h(x,y) =0 
have two real roots for |w| < z, namely 
ry=w r2=w 
2 


w= V22-—w2 yo = —V 2? —w?. 


At both roots, |.J| has the same value: 


Zz 
V 22 — w? 


Hence a direct application of Equation 3.4-11 yields 


Fi] = [Ja] = 


faw (2,0) Tosi fxr (m, yi) + fxy (x2, y2)). 


Now assume that 


elle? +y?)/207] 


fxy(z,y) = Jaane 


Then, since in this case fyy(a,y) = fxy(x,—y), we obtain 


1 z =a [207 


few(z,w) = ¢ 102 Je — we? 


0, otherwise. 


z>0,|w| < z, 


+See Chapter 5 on random vectors. 


204 Chapter 3 Functions of Random Variables 


2 2 


Z* — W 


Figure 3.5-1  Trigonometric transformation w = zsin 0. 


However, we don’t really want fzw(z,w), but only the marginal pdf fz(z). To obtain this, 
we use Equation 2.6-47 (with a replaced by z and y replaced by w). This gives 


fz(z) = [. faw(z, w)dw 


_ * —2z7 /20? 2 ° dw 
= ae = / a ae u(z). 


The term in parentheses has value unity. To see this consider the triangle in 


Figure 3.5-1 and let w 2 zsin@. Then dw = zcos@d@ and [z? — w?]/?2 = zcos@ and 
the term in parentheses becomes 


Zz n/2 
= | dw 2 = | did. 
T Jo VJ z2 — w2 T Jo 


z& 2? /2¢2 
fz(z) = oze ie u(z), 
which is the same result as obtained in Equation 3.3-33, obtained there by a different 
method. 


Example 3.5-5 
(sum and difference again) Finally, let us return to the problem considered in Example 3.4-1: 


Hence 


vax+y 
Wex-y 
The only root to 
v—(x#+y) =0 
w— (ey) = 
is 
_vutw 
ee 9 
v—w 
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and |.J| = 3. Hence 


fyw(v,w) = sixy ex *) ; 


We verify in passing that 
flv) = / fvw(v, w)dw 


ee: v+w vu-w . Av+tW 
=| sty ( ae ) aw with z= 


- [. fxy(z,v — 2)dz. 


This important result on the sum of two RVs was derived in Section 3.3 (Equation 3.3-12) 
by different means. 


SUMMARY 


The material in this chapter discussed functions of random variables, a subject of great 
significance in applied science and engineering and fundamental to the study of random 
processes. The basic problem dealt with computing the probability law of an output random 
variable Y produced by a system transformation g operating on an input random variable 
X (ie., Y = g(X)). The problem was then extended to two input random variables X, Y 
being operated upon by system transformations g and h to produce two output random 
variables V = g(X,Y) and W = h(X,Y). Then the problem is to compute the joint pdf 
(PMF) of V, W from the joint pdf (PMF) of X, Y. 

We showed how most problems involving functions of RVs could be computed in at 
least two ways: 


1. the so-called indirect approach through the CDF; and 
2. directly through the use of a “turn-the-crank” direct method. 


A number of important problems involving transformations of random variables were worked 
out including computing the pdf (and PMF) of the sum of two random variables, a problem 
which has numerous applications in science and engineering where unwanted additive noise 
contaminates a desired signal or measurement. For example, the so-called “signal and addi- 
tive noise problem” is a seminal issue in communications engineering. 

Later, when we extend the analysis of the sum of two independent random variables to 
the sum of n independent random variables, we will begin to observe that the CDF of the 
sum starts to “look like” the CDF of a Normal random variable. This fundamental result, 
that is, convergence to the CDF of the Normal, is called the Central Limit Theorem, and is 
discussed in Chapter 4. 

Finally we considered how to compute the pdf of two ordered random variables. We 
found we could do this using the powerful so-called direct method for computing distributions 
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of RVs fuctionally related to other RVs. Later, in Chapter 5 on random vectors, we will 
discuss transformations involving n ordered RVs. Ordered random variables appear in a 
branch of statistics called nonparametric statistics and often yield results that are inde- 
pendent of underlying distributions. In this sense, ordered random variables yield a certain 


level of robustness to expressions derived about them. 


PROBLEMS 

(*Starred problems are more advanced and may require more work and/or additional 
reading.) 

3.1 Let X have CDF F'x(x) and consider Y = aX + b, where a < 0. Show that if X is 


3.2 


3.3 


3.4 


not a continuous RV, then Equation 3.2-3 should be modified to 


Fy(y) =1- Fx (XH) p|x= *) 


a 


Showy (<*) py). 


a 


Let Y be a function of the RV X as follows: 


ys xX, X20, 
xe, XS, 


Compute fy(y) in terms of fx(x). Assume that X:N(0, 1). 


(function of RV) Let the random variable X be Gaussian distributed as N(0, 25). 


Define the random variable Y = g(X) with the function g given as 


2x, x7>0, 
g(t) = —x£, x2 <0. 


Find fy(y) the pdf of Y. 
Let Y be a function of the random variable X as follows: 


yA xX, X >0, 
~ | 2X2, X <0. 


Compute pdf fy(y) in terms of pdf fx(x). Let fx(x) be given by 


that is, X:N(0, 2). 
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3.5 


3.6 


*3.7 


*3.8 


*3.9 


3.10 


Let X have pdf 

fx(a) =ae-*"u(z). 
Compute the pdf of (a) Y = X3; (b) Y = 2X +3. 
Let X be a Laplacian random variable with pdf 


1 
fx(a) = re —-w<@%<+o. 


Let Y = g(X), where g(-) is the nonlinear function given as the saturable limiter 


-1l, a«<-tl, 
g(x) 2 zg, -l<a<+1, 
+1, w#>+1 


Find the distribution function Fy (y). 

In medical imaging such as computer tomography, the relation between detector 
readings y and body absorptivity x follows a y = e” law. Let X:N(ju, 07); compute 
the pdf of Y. This distribution of Y is called lognormal. The lognormal random 
variable has been found quite useful for modeling failure rates of semiconductors, 
among many other uses. 


In the previous problem you found that if X : N(,07), then Y = exp X has a 
lognormal density or pdf 


flv) = an| —_ | uly). 


(a) Sketch the lognormal density for a couple of values of ~ and o. 

(b) What is the distribution function of the lognormal random variable Y? Express 
your answer in terms of our erf function. Hint: There are two possible approaches. 
You can use the method of substitution to integrate the above density, or you 
can find the distribution function of Y directly as a transformation of random 
variable problem. 


In homomorphic image processing, images are enhanced by applying nonlinear trans- 
formations to the image functions. Assume that the image function is modeled as 
RV X and the enhanced image Y is Y = In X. Note that X cannot assume negative 


1 
values. Compute the pdf of Y if X has an exponential density fx (x) = $e 3” u(z). 


Assume that X:N(0,1) and let Y be defined by 
yilvX, X20, 
16, X <0. 


Compute the pdf of Y. 


(a) Let X:N(0,1) and let Y S g(X), where the function g is shown in Figure P3.11. 
Use the indirect approach to compute Fy(y) and fy(y) from fx(«). (b) Compute 
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fy(y) from Equation 3.2-23. Why can’t Equation 3.2-23 be used to compute fy(y) 
at y = 0,1? 


g(x) 


1 2 x 


Figure P3.11 


3.12 Let X:U[0,2]. Compute the pdf of Y if Y = g(X), where the function g is plotted 
in Figure P3.12. 


g(x) 


1 


Figure P3.12 


3.13 Let X:U[0,2], Compute the pdf of Y if Y = g(X) with the function g as shown in 
Figure P3.13. 


g(x) 


al 
2 


Figure P3.13 
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3.14 


3.15 


3.16 


3.18 


3.20 
3.21 
3.22 


3.23 


Let the RV X:N(0,1), that is, X is Gaussian with pdf 


1 =x? 
C= e2, —o <4“4< +00. 
fx( ) Jor 
Let Y = g(X), where g is the nonlinear function given as 
x -1l, «<-l, 
g(x) = z, =lx es, 
1, eS 1; 


It is called a saturable limiter function. (a) Sketch g(a); (b) find Fy (y); (c) find and 
sketch fy(y). 

Compute the pdf of Y = a/X(a > 0). Show that if X is Cauchy with parameter a, 
Y is Cauchy with parameter a/a. 

Let Y 4 sec. X. Compute fy(y) in terms of fx(a). What is fy(y) when fx(x) is 
uniform in (—7, 7]? 

Given two uniformly distributed random variables X:U(—1,+1) and Y: U(—2, +2), 
find the density function for Z = X + Y under the condition that X and Y are 
independent. 

Let X and Y be independent and identically distributed exponential RVs with 


fx(a) = fy(a) = ae **u(z). 


Compute the pdf of Z SY, 
Let random variables X and Y be described by the given joint pdf fx (a, y). Define 
new random variables as 


VAxX+Y and W42X-Y. 


(a) Find the joint pdf fy,w(v, w) in terms of the joint pdf fx y (2, y). 
(b) Show, using the results of part (a) or in any other valid way, that under 
suitable conditions 


+00 
fa(2) = / fx()fy(2—2)de, 


—oco 


for Z = X + Y. What are the suitable conditions? 


Repeat Example 3.2-11 for fx (x) = e~*u(a). 

Repeat Example 3.2-12 for fx (x) = e~*u(2). 

The objective is to generate numbers from the pdf shown in Figure P3.22. All 
that is available is a random number generator that generates numbers uniformly 
distributed in (0,1). Explain what procedure you would use to meet the objective. 
It is desired to generate zero-mean Gaussian numbers. All that is available is a 
random number generator that generates numbers uniformly distributed on (0,1). 
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F(x) 


Figure P3.22 


It has been suggested Gaussian numbers might be generated by adding 12 uniformly 
distributed numbers and subtracting 6 from the sum. Write a program in which you 
use the procedure to generate 10,000 numbers and plot a histogram of your result. 
A histogram is a bar graph that has bins along the z-axis and number of points in 
the bin along the y-axis. Choose 200 bins of width 0.1 to span the range from —10 
to 10. In what region of the histogram does the data look most Gaussian? Where 
does it look least Gaussian? Give an explanation of why this approach works. 

*3.24 Random number generators on computers often provide a basic uniform random 
variable X: U{0,1]. This problem explores how to get more general distributions by 
transformation of such an X. 


(a) Consider the Laplacian density fy(y) = § exp(—cly|), —oo < y < +00, with 
parameter c > 0, that often arises in image processing problems. Find the 
corresponding Laplacian distribution function Fy(y) for —oo < y < +00. 

(b) Consider the transformation 


z= g(x) = Fy*(2), 


using the distribution function you found in part (a). Note that F-' denotes 
an inverse function. Show that the resulting random variable Z = g(X) will 
have the Laplacian distribution with parameter c if X:U[0,1]. Note also that 
this general result does not depend on the Laplacian distribution function 
other than that it has an inverse. 

(c) What are the limitations of this transform approach? Specifically, will it work 
with mixed random variables? Will it work with distribution functions that 
have flat regions? Will it work with discrete random variables? 

3.25 In Problem 3.18 compute the pdf of |Z]. 

3.26 Let X and Y be independent, continuous RVs. Let Z = min(X,Y). Compute Fz(z) 
and fz(z). Sketch the result if X and Y are distributed as U(0,1). Repeat for the 
exponential density fx (a) = fy(a) = aexp[—ag] - u(z). 

3.27 Let Z4 max(X1, X2), where X; and X2 are independent and exponentially distributed 
random variables with pdf 
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3.28 


3.29 


3.30 


3.31 


3.32 


3.33 


(a) Find the distribution function of Z. 
(b) Find the pdf of Z. 


Consider n iid. RVs X1,X2,...,Xn with common CDF Fx,(x) 2 F(x). Let Z 2 
max[X1, X2,...,X,]. Compute the CDF of Z in terms of the CDF F. 

Consider n iid. RVs X1,X2,...,Xn with common CDF Fx,(x) 2 F(x). Let Z 2 
min[X,, X2,...,X,]. Compute the CDF of Z in terms of the CDF F(x). 

Let X 1, X2,...,Xp be ni.i.d. exponential random variables with fx,(x) = e~*u(2). 
Compute an explicit expression for the pdf of Z, = max(X1, X2,...,X,). Sketch 
the pdf for n = 3. 
Let X1, X9,...,Xn be ni.i.d. exponential random variables with fx,(a) = e~*u(z). 
Compute an explicit expression for the pdf of Z, = min(X1, X2,...,Xn). Sketch 
the pdf for n = 3. 

Let X, Y be iid. as U(—1,1). Compute and sketch the pdf of Z for the system 
shown in Figure P3.31. The square-root operation is valid only for positive numbers. 
Otherwise the output of the va is zero. 


x vA 


Figure P3.31 A square-root device. 


The length of time, Z, an airplane can fly is given by Z = aX, where X is the 
amount of fuel in its tank and a > 0 is a constant of proportionality. Suppose a plane 
has two independent fuel tanks so that when one gets empty the other switches on 
automatically. Because of lax maintenance a plane takes off with neither of its fuel 
tanks checked. Let X, be the fuel in the first tank and X» the fuel in the second 
tank. Let X; and X2 be modeled as uniform i.i.d. RVs with pdf fx, (a) = fx,(a) = 
;[u(x) — u(x — b)|. Compute the pdf of Z, the maximum flying time of the plane. If 
b = 100, say in liters, and a = 1 hour/10 liters, what is the probability that the 
plane will fly at least five hours? 

Let X and Y be two independent Poisson RVs with PMFs 


i ; 
Px(k) = pe 2 uk) and (3.5-6) 

1 
Py(k) = ae Bulk), respectively. (3.5-7) 


Compute P[Z < 5], where Z A X+Y. | Hint: ye (") Jp°-J = (a+b)". 
j=0 
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3.34 


3.35 


3.36 


*3.37 


3.38 


Given two random variables X and Y that are independent and uniformly distributed 
as U(0,1): 


(a) Find the joint pdf fy,y of random variables U and V defined as: 


I> 


> 
Nie Mle 


U (X+Y) and 


S29). 


(b) Sketch the support of fu,y in the (u,v) plane. Remember support of a function 
is the subset of its domain for which the function takes on nonzero values. 


Let X and Y be independent, uniformly distributed RVs, both with pdf fx(z) = 3, 


|z| < 1 and zero otherwise, and fy(y) = 4, |y| < 2 and zero otherwise. Compute 


(a) the pdf of Z 2 X +; (b) the pdf of Z22X —Y. 
Compute the joint pdf fzw(z, w) if 


Za x?4Y? 
wx 
when 


en l(a? +y?)/20%) 


fxy(z,y) = —00 <2 <00,-00 <y< oo. 


Qa? 


Then compute the fz(z) from your results. 
Consider the transformation 


Z=aX + bY 
W=cX +dY. 
Let 7 
fav (0.9) = Sage 
where 


1 2 2 
= 2px’ : 
Q(z, y) 3o%1 — pe) [x — 2pry + y*] 
What combination of coefficients a, b, c, d will enable Z, W to be independent 
Gaussian RVs? 


Let 


1 x? — pry + y? 
favo) = ae |-( 21 = p*) )} 


Compute the joint pdf fyw(v, w) of 
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3.39 Derive Equation 3.4-4 by the direct method, that is, using Equations 3.4-11 or 3.4-12. 
3.40 Consider the transformation 


Z = Xcos6+Y sind (3.5-10) 
W = Xsiné— Y cos@. (3.5-11) 


Compute the joint pdf fzw(z,w) in terms of fxy(a, y) if 
1 —1 (a? +y?) 
fxy(a,y) = 5-e 2 ¥)—-c <4 <00,-00 < yy < ©. 
TT 


(It may be helpful to note that this transformation is a rotation by +6 followed by 
a negation on W.) 
3.41 Compute the joint pdf of 


Za x?+y? 


w Say 


when 


1 2 2 2 
_ —([(a*+4 20 
fxy(x,y) = aor° (2° +y")/207] 


3.42 Let fxy(a,y) = A(x? + y?) for 0 < x <1, |y| < 1, and zero otherwise. Compute the 
CDF Fxy(a,y) for all x, y. Determine the value of A. 

3.43 Consider the input-output view mentioned in Section 3.1. Let the underlying exper- 
iment be observations on an RV X, which is the input to a system that generates 


an output Y = g(X). 
(a) What is the range of Y? 
(b) What are reasonable probability spaces for X and Y? 
(c) What subset of R! consists of the event {Y < y}? 
(d) What is the inverse image under Y of the event (—co,y] if Y = 2X + 3? 


3.44 In the diagram shown in Figure P3.44, it is attempted to deliver the signal X from 
points a to b. The two links L1 and L2 operate independently, with times-to-failure 
T,, To, respectively, which are exponentially and identically distributed with rate 
A (>0). Set Y = 0 if both links fail. Denote the output by Y and compute Fy (y, ¢), 
the CDF of Y at time t. Show for any fixed ¢ that Fy(oo,t) = 1. 
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L1 


U7) 


L2 
Figure P3.44 parallel links. 
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P Expectation and Moments 


4.1 EXPECTED VALUE OF A RANDOM VARIABLE 


It is often desirable to summarize certain properties of an RV and its probability law by a 
few numbers. Such numbers are furnished to us by the various averages, or expectations of 
an RV; the term moments is often used to describe a broad class of averages, and we shall 
use it later. 

We are all familiar with the notion of the average of a set of numbers, for example, the 
average class grade for an exam, the average height and weight of children at age five, the 
average lifetime of men versus women, and the like. Basically, we compute the average of a 
set of numbers 21, 2%2,...,2y as follows: 


1 
b= a oa (4.1-1) 
im 
where the subscript s is a reminder that py, is the average of a set. 
The average yw, of a set of numbers 21,22,...,%N can be viewed as the “center of 


gravity” of the set. More precisely the average is the number that is simultaneously closest 
to all the numbers in the set in the sense that the sum of the distances from it to all the 
points in the set is smallest. To demonstrate this we need only ask what number z minimizes 
the summed distance D or summed distance-square D? to all the points. Thus with 


N 


ps SG = 0)", 


i=1 
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the minimum occurs when dD?/dz = 0 or 


dD? = 
= ane =0, 
which implies that 
1 N 
Z= Us WV 2, Xj. 


Note that each number in Equation 4.1-1 is given the same weight (i.e., each 2; is multiplied 
by the same factor 1/N). If, for some reason we wish to give some numbers more weight 
than others when computing the average, we then obtain a weighted average. However, we 
won’t pursue the idea of a weighted average any further in this chapter. 

Although the average as given in Equation 4.1-1 gives us the “most likely” value or the 
“center of gravity” of the set, it does not tell us how much the numbers spread or deviate 
from the average. For example, the sets of numbers S; = {0.9,0.98,0.95, 1.1, 1.02, 1.05} 
and Sj = {0.2, —3,1.8,2,4,1} have the same average but the spread of the numbers in S$ 
is much greater than that of S,;. An average that summarizes this spread is the standard 
deviation of the set, o,, computed from 


1/2 
Os= E aC = HP (4.1-2) 


Equations 4.1-1 and 4.1-2, important as they are, fall far short of disclosing the usefulness 
of averages. To exploit the full range of applications of averages, we must develop a calculus 
of averages from probability theory. 

Consider a probability space (Q,.% P) associated with an experiment .7% and a discrete 


RV X. Associated with each outcome ¢, of .%, there is a value X(¢,) & x;, which the RV 
X takes on. Let 71,7%2,...,2,¢ be the M distinct values that X can take. Now assume that 
Ie is repeated N times and let x") be the observed outcome at the kth trial. Note that 
az) must assume one of the numbers 21,...,2,¢. Suppose that in the N trials 2, occurs 
n, times, xp occurs no times, and so forth. Then for N large, we can estimate the average 
value ix of X from the formula 


1 N 
ix ae So a) (41-3) 


2 

a= 

ng 
= 
= 
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ies D7 (=) (4.1-4) 


~ De aiP[X = xj]. (4.1-5) 


Example 4.1-1 
(loaded dice) We observe 17 tosses of a loaded die. Here N = 17, M = 6 (the six faces of the 
die.) The observations are {1,3,3,1,2,1,3,2,1,1,2,4,1,1,5,3,6}. Let P[i] denote the probability 
of observing the face with the number 7 on it. Then from the observational data we get 


Pll] ~ 7/17; P[2] ~ 3/17; P[3] ~ 4/17; P[4] ~ 1/17; P[5] ~ 1/17; P[6] = 1/17. 


These estimates of the “true” probabilities are quite unreliable, however. To get more reliable 
date we would have to greatly increase the number of tosses. We might ask what are the 
“true ” probabilities anyway. One answer might be that the Laws of Nature have imbued 
the die with an inherent set of probabilities that must be determined by experimentation. 
Another view is that the true probabilities are the ratios P[i] = n;/N you get when N 
becomes arbitrarily large. However what is meant by arbitrarily large? For any finite values 
of N the estimated probabilities will always change as we increase N. These conundrums 
are mostly resolved by statistics discussed in some detail Chapters 6 and 7. 


Equation 4.1-5, which follows from the frequency definition of probability, leads us to our 
first definition. 


Definition 4.1-1 The expected or average value of a discrete RV X taking on values 
x; with PMF Px(a;) is defined by 


BX] 27 aiPx(e:). (4.1-6) 


As given, the expectation is computed in the probability space generated by the RV. We can 
also compute the expectation by summing over all points of the discrete sample space, that 
is, EX] = 9 X(¢,)PI{¢,}], where the ¢; are the discrete outcome points in the sample 
space 2. 

A definition that applies to both continuous and discrete RVs is the following: 


Definition 4.1-2 The expected value or mean, if it exists,’ of a real RV X with pdf 
fx (a) is defined by 


oS 

E|X] = / tfx(x)dc. (4.1-7) 
—co 

Here, as well as in Definition 4.1-1, the expectation can be computed in the original proba- 

bility space. If the sample description space is not discrete but continuous, for example, an 


'The expected value will exist if the integral is absolutely convergent, that is, if [°° |a| fx (a) da < oo. 
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uncountable infinite set of outcomes such as the real line. Then E[X] = f, X(¢)Pl{dc}], 
where P[{d¢}] is the probability of the infinitesimal event {¢ < ¢’ < ¢+ d¢}. 

The symbols E[X], X, 2x, or simply p are often used interchangeably for the expected 
value of X. Consider now a function of an RV, say, Y = g(X). The expected value of Y is, 
from Equation 4.1-7, 


EY] = / “ yfy (y)dy. (4.1-8) 


However, Equation 4.1-8 requires computing fy(y) from fx(a). If all we want is E[Y], 
is there a way to compute it without first computing fy(y)? The answer is given by 
Theorem 4.1-1 which follows. 


Theorem 4.1-1 The expected value of Y = g(X) can be computed from 


ayi= f * ales fa (4.1.9) 


—co 


where g is a measurable (Borel) function.! Equation 4.1-9 is an important result in the theory 
of probability. A rigorous proof of Equation 4.1-9 requires some knowledge of Lebesgue 
integration; we offer instead an informal argument below to argue that Equation 4.1-9 is 
valid. 


On the Validity of Equation 4.1-8 


Recall from Section 3.2 that if Y = g(X) then for any y,; (Figure 4.1-1) 


; k k k 
{yj <¥ Sy + Ay} = Ufa? < X <a + An}, (4.1-10) 
k=1 


where 7; is the number of real roots of the equation y; — g(x) = 0, that is, 


1 [5 
yy = g(a) =... = g(a). (4.1-11) 
The equal sign in Equation 4.1-10 means that the underlying event is the same for both 
mappings X and Y. Hence the probabilities of the events on either side of the equal sign 
are equal. The events on the right side of Equation 4.1-10 are disjoint and therefore the 
probability of the union is the sum of the probabilities of the individual events. Now partition 


+See definition of a measurable function in Section 3.1. 
+See Feller [4-1, p.5] or Davenport [4-2, p.223] 
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x 
x4) xg M$ AM x2) [x2 5) xix tO) + A xt 
Figure 4.1-1 Equivalence between the events given in Equation 4.1-10. 
the y-axis into many fine subintervals y1, y2,..-,yj,---. Then, approximating Equation 4.1-8 
with a Riemann’ sum and using Equation 2.4-6, we can write? 
co 
EY] = / yfy (y)dy 
—oo 
m 
~ So ysPlyy <¥ Sy + Ay] 
j=l 
m TG 
=> yo) Pe? <x ee ene”), (4.1-12) 
j=l k=1 


The last line of Equation 4.1-12 is obtained with the help of Equations 4.1-10 and 4.1-11. 
But the points a are distinct, so that the cumbersome double indices 7 and k can be 


replaced with a single subscript index, say, 7, The Equation 4.1-12 becomes 
EIY] ~ So g(a) Plai < X < 2; + Ani], 
i=1 
and as Ay, Ax — 0 we obtain the exact result that 


E|Y| = / g(x) fx (x) da. (4.1-13) 
Equation 4.1-13 follows from the Riemann sum approximation and Equation 2.4-6; the 2; 
have been ordered in increasing order 41 < %2 <4%3<.... 


+Bernhard Riemann (1826-1866). German mathematician who made numerous contributions to the 
theory of integration. 
*The argument follows that of Papoulis [4-3, p.141] 
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In the special case where X is a discrete RV, 


EIY| =) ola) Px(2i). (4.1-14) 


This result follows immediately from Equation 4.1-13, since the pdf of a discrete RV involves 
delta functions that have the property given in Equation B.3-1 in Appendix B. 


Example 4.1-2 


(expected value of Gaussian) Let X : N(y,07), read “X is distributed as Normal with para- 
meters ps and o?.” The expected value or mean of X is 


six)= [” »(sego( 3 (4))) dx. 


Let 22 (a — ys) /o. Then 


E|X] = al ze-?* dz+p (=| oa) . 


The first term is zero because the integrand is odd, and the second term is 44 because the 
term in parentheses is P[Z < co], which is the certain event for Z: N(0,1). Hence 


y] 


E|X] =u for X: N(, 07). 


Thus, the parameter js in N(,07) is indeed the expected or mean value of X as claimed 
in Section 2.4. 


Example 4.1-3 
(expected value of Bernoulli RV) Assume that the RV B is Bernoulli distributed taking on 
value 1 with probability p and 0 with probability g=1-—p. Then the PMF is given as 


p, when k= 1, 
Pp(k) =< gq, when k =0, 


0, else. 
The expected value is then given as 
+oo 
E(B) = > kPp(k) 
b=— 00 
=1p+0 
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Example 4.1-4 
(expected value of binomial RV) Assume that the RV K is binomial distributed with PMF 
Px(k) = b(k;n,p). Then we calculate the expected value as 


+00 
E[K]= S> kPx(k) 
k=—00 
= $0 kb(k;n,p) 
k=0 


II 
so 
aaa 
x 3 
Na” 
3 
Eo 
— 
| 
& 
i 
> 


k=0 

n n! P _ 
— 1 n 

» Hn p? -?) 

n nl _ 
_— pe (1 p)” k 


n! ' Woe A 
=) mmawo?  -P) k-l with 6k 2k—-1, 


n—-1 
= np, — since the sum in the round brackets is ve b(k3n —1,p) = 1. 
k=0 
Example 4.1-5 
(more on multiple lottery tickets) We continue Example 1.9-6 of Chapter 1 on whether 
it is better to buy 50 tickets from a single lottery or 1 ticket each from 50 successive 
lotteries, all independent and with the same fair odds. Here we are interested in the mean 
or expected return in each case. Again each lottery has 100 tickets at $1 each and the 
fair payoff is $100 to the winner. For the single lottery, we remember the odds of winning 
are 50 percent, so the expected payoff is $50. For the 50 plays in separate lotteries, we 
recall that the number of wins K is binomial distributed as b(k; 50,0.01), so the mean value 
E|K] = np = 50 x 0.01 = 0.5. Since the payoff would be $100K, the average payoff would 
be $50, same as in the single lottery. 


Example 4.1-6 
(expected value of Poisson) Let K be a Poisson RV with parameter a > 0. Then 
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=a. (4.1-15) 


Thus, the expected value of Poisson RV is the parameter a. 


Linearity of expectation. When we regard mathematical expectation FE as a operator, it 
is relatively easy to see it is a linear operator. For any X, consider 


N too [ N 
] 009] - / (Spa) fx (a)dax (4.1-16) 


N Pe 

=o] gila)fx(o)de ay 
N 

= 7 Flai(X)) Cv 


provided that these exist. The expectation operator E is also linear for the sum of two RVs: 


+00 +00 
E[X +Y] = / / ect ah Peat ae 


+o0o +00 +00 +00 
= / / ufxy(x, y)dxdy + / i yfx,y (x, y)dxdy 


co —co 


ay a ( / - Feet, iNav) a+ [ - y ( i - ae, iar) dy 


a - ofe(ade + | - viv (a)dy 


= E[X]+ E[Y]. 


l| 


The reader will notice that this result can readily be extended to the sum of N RVs 
X1, Xo, wae XN. Thus, 


E 


N N 
a x = > E[X;]. (4.1-19) 


Sec. 4.1. EXPECTED VALUE OF A RANDOM VARIABLE 223 


Note that independence is not required. We can summarize both linearity results by saying 
that the mathematical expectation operator E distributes over a sum of RVs. 


Example 4.1-7 
(variance of Gaussian) Let X : N(u, 07) and aera the zero-mean RV X — py with variance 
E[(X — p)"] = ELX? — 2uX + p2] = ELX?] — p2 = Var[X] by the linearity of expectation 
E. We can write 


2 to° 2 1 _ (ap)? 
E((X — p)"] = (x — p) e207 dx 
oe 210 


+00 2 
= al 2e-Zdz with substitution z4 (a — p)/o. 
co 


N 
iw) 


ae 
Next we integrate by parts with u = z and du = ze” 2 dz, yielding du = dz andv = —e 7, 
so that, the above integral becomes 


+00 2 “3 00 22 
/ ve Tdz= (-ze"*) |¢ Lan +f e 2dz 


[oe} —co 


=-0+0+V2r, 


where the last term is due to the fact that the standard Normal N(0, 1) density integrates to 
1. Thus we have E[(X — p)*] = On =o”, and thus the parameter o? in the Gaussian 
density is shown to be the variance of the RV X — py, which is the same as the variance of 
the RV X. 


We have now established that the parameters introduced in Chapter 2, upon definition 
of the Gaussian density, are actually the mean and variance of this distribution. In practice 
these basic parameters are often estimated by making many independent observations on 
X and using Equation 4.1-1 to estimate the mean and Equation 4.1-2 to estimate a. 


Example 4.1-8 
(mean of Cauchy) The Cauchy pdf with parameters a(—oo < a < oo) and G(8 > 0) is 
given by 


co <2 < 00. (4.1-20) 
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is an improper integral and doesn’t converge in the ordinary sense. However, if we evaluate 
the integral in the Cauchy principal value sense, that is, 


x0 1 
AX) = hi —.— ]d 4.1-21 
l= tim [f (sary) aay 
then E[X] = 0. Note, however, that with Y ax 2 E[Y] doesn’t exist in any sense because 
a 1 
EY) = 2 | Sl ae 4.1-22 
M1] fo ccesy a ne 


and thus fails to converge in any sense. Thus, the variance of a Cauchy RV is infinite. 


Expected value of a function of RVs. For a function of two RVs, that is, Z = g(X,Y), 
the expected value of Z can be computed from 


E[Z] = i zfz(z)dz 


= /. . g(x,y) fxy (a, y) dx dy. (4.1-23) 


To prove that Equation 4.1-23 can be used to compute E[Z] requires an argument similar 
to the one we used in establishing Equation 4.1-9. Indeed one would start with an equation 
very similar to Equation 4.1-10, for example, 
Nj 
{25 < Z <2 + A2} = (J{(XY) € Di}, 
k=1 


where the D;, are very small disjoint regions containing the points (x! yl )) such that 


g(a! ) yl! )) = z;. Taking probabilities of both sides and recalling that the Dy, are disjoint, 


yields 
g (25) Az & Hel Aa), 


where Aad ) is an infinitesimal area. 
Now multiply both sides by z; and recall that z; = g(a! ) yl! )), Then 


Ni 
ayfaleAzy ~ > g(a? yg) fev (a? yp Aa? 
k=1 


and, as 7 + oo, Az; — 0, Aa) — da = dx dy, 


is ae io [. g(x,y) fx (x,y) dx dy. 
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An alternative proof is of interest’. As before let Z = g(X,Y) and write 


E[Z] = ‘ae zfz(z)dz 


= 7 - . efayw (ely) fy (y)dy ae. 


The second line follows from the definition of a marginal pdf. Now recall that if Z = g(X) 
then 


[ated = [ ole)te(a)ae 


—co 


We can use this result in the present problem as follows. If we hold Y fixed at Y = y, then 
g(X,y) depends only on X, and the conditional expectation of z with Y = y is 


i. zfayy(zly)dz = - g(x,y) fxyy (aly) dx. 


Using this result in the above yields 


E[Z] = [. zfz(z)dz 


= [. ([- efay(elilde) fy (y)dy 


-[- ia g(x,y) fxyy (aly) fy (y) dx dy 


o [. [. x,y) fxy (a, y) dex dy. 


Example 4.1-9 
(mean of product of independent RVs) Let g(x,y) = xy. Compute E[Z] if Z = g(X, Y) with 
X and Y independent and Normal with pdf 


fxy (x,y) = exp = ((a lege (y LWp)”) 


Qro2 


*Carl W. Helstrom, Probability and Stochastic Processes for Engineers, 2nd edition. New York, 
Macmillan, 1991. 
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Solution Direct substitution into Equation 4.1-23 and recognizing that the resulting 
double integral factors into the product of two single integrals enables us to write 


il a 1 
E|Z| = — [2 [sorte — 1.) dx 
i 7 1 ‘ 
> i —e! ex == = d 
at P| 553 Y ws? y 
= Habbo: 


Equation 4.1-23 can be used to compute E[X] or E[Y]. Thus with Z = g(X,Y) = X, we 
obtain 
=| / afxy (x,y) dx dy 


= f. ia fev(ess)du] a dz. (4.1-24) 


By Equation 2.6-47, the integral in brackets is the marginal pdf fx(x). Hence Equa- 
tion 4.1-23 is completely consistent with the definition 


E[|X|= a ufx(a) da. 


—oo 


With the help of marginal densities we can conclude that 


E[X+Y]= [. i (a+ y)fxy(x, y)dady 


ae (I, fxy(x,y)d v) def u(f fxv@wae) ay 


= E[X]+ E[Y]. (4.1-25) 


Equation 4.1-24 can be extended to N random variables X1, X2,...,Xy. Thus 


N N 
E|S- x =) EX] (4.1-26) 


Note that independence is not required. 


Example 4.1-10 
(independent Normal RVs) Let X, Y be jointly normal, independent RVs with pdf 


1 l _ 2 _ 2 
fxy(2,y) = exp = + ae 
270102 2 C1 02 
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It is clear that X and Y are independent since fyy(x,y) = fx(x) fy(y). The marginal pdf’s 
are obtained using Equations 2.6-44 and 2.6-47: 


fx(2) = pow 5({ =] 


f(y) = eq | 3 (=H). 


Thus Equation 4.1-24 yields 


E(X + Y] = py + bp. 


Example 4.1-11 
(Chi-square law) In a number of problems in engineering and science, signals add inco- 
herently, meaning that the power in the sum of the signals is merely the sum of the 
powers. This occurs, for example, in optics when a surface is illuminated by light sources 
of different wavelengths. Then the power measured on the surface is just the sum of 
the powers contributed by each of the sources. In electric circuits, when the sources are 
sinusoidal at different frequencies, the power dissipated in any resistor is the sum of the 
powers contributed by each of the sources. Suppose the individual source signals, at a 
given instant of time, are modeled as identically distributed Normal RVs. In particular 
let X1,X2,...,Xp represent the n independent signals produced by the n sources with 
X;: N(0,1) for i=1,2,...,n and let Y; = X?. We know from Example 3.2-2 in Chapter 3 
that the pdf of Y; is given by 
1 
fy,(y) = Tage ule). 

Consider now the sums Z) = Y; + Yo, Z3 = Yi + Yo+ Y3,...,Zn = UL Y;. The pdf of Z2 
is easily computed by convolution as 


P= [- tty ae —— 


co W202 2n(z— 2) 


e722 ay (z — x) dx 


1 
72/2 


et ae 


I 


1 
- 56 *Pu(z) (exponential pdf). 
To get from line 1 to line 2 we let « = y?. To get from line 2 to line 3, we used that the 
integral is an elementary trigonometric function integral in disguise. To get the pdf of Zs 
we convolve the pdf of Z2 with that of Y3. The result is 


1 . 1 al 
—x/2 —1(z—-2) 
~(2) == e u(x) X —————e_2 u(z — x) dx 
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The Chi-square density for n = 30, 40, 50 


pdf value 
oO 
oO 
wo 


0 10 20 30 40 50 60 70 80 90 100 
Argument value 


Figure 4.1-2. The Chi-square pdf for three large values for the parameter n: n = 30 (solid); n = 40 
(dashed); n = 50 (stars). For large values of n, the Chi-square pdf can be approximated by a normal 
N(n, 2n) for computing probabilities not too far from the mean. For example, for n = 30, Pi —a < 
X < w+oa] = 0.6827 assuming X: N(30, 60). The value computed, using single-precision arithmetic, 
using the Chi-square pdf, yields 0.6892. 


We leave the intermediate steps which involve only elementary transformations to the reader. 
Proceeding in this way, or using mathematical induction, we find that 


i oe, 
fz, (2) = 272 F(n 2)” z €@€ /u(z), 


This pdf was introduced in Chapter 2 as the Chi-square pdf. More precisely, it is known as 
the Chi-square distribution with n degrees-of-freedom. For n > 2, the pdf has value zero at 
z = 0, reaches a peak, and then exhibits monotonically decreasing tails. For large values of 
n, it resembles a Gaussian pdf with mean in the vicinity of n. However, the Chi-square can 
never be truly Gaussian because the Chi-square RV never takes on negative values. The 
character of the Chi-square pdf is shown in Figure 4.1-2 for different values of large n. 

The mean and variance of the Chi-square RV are readily computed from the definition 
Ln = yp Xe. Thus B[Z,,|) = £2, X7) = BELA] =n. Also Var(Z,,) = £/(2, — 7)’. 
After simplifying, we obtain Var(Z,,) = E[Z2] — n?. We leave it to the reader to show that 
E{Z?] = 2n + n? and, hence, that Var(Z,) = 2n. 


Example 4.1-12 
At the famous University of Politicalcorrectness (U of P), the administration requires that 
each professor be equipped with an electronic Rolodex which contains the names of every 
student in the class. When the professor wishes to call on a student, she merely hits the 
“call” button on the Rolodex, and a student’s name is selected randomly by an electronic 
circuit inside the Rolodex. By using this device the professor becomes immune to charges 


Sec. 4.1. EXPECTED VALUE OF A RANDOM VARIABLE 229 


of bias in the selection of students she calls on to answer her questions. Find an expression 
for the average number of “calls” r required so each student is called upon at least once. 


Solution The use of the electronic Rolodex implies that some students may not be called 
at all during the entire semester and other students may be called twice or three times 
in a row. It will depend on how big the class is. Nevertheless the average is well defined 
because extremely long bad runs, that is, where one or more students are not called on, 
are very rare. The careful reader may have observed that this is an occupancy problem if 
we associate “calls” with balls and students with cells. Let R € {n,n+1,n+2,...} denote 
number of balls needed to fill all the n cells for the first tume. The only way that this can 
happen is that the first R—1 balls fill all but one of the n cells (event £,) and the Rth ball 
fills the remaining empty cell (event F2). Translated to the class situation, this means that 
after R—1 calls, all but one student will have been called (event Fj) and this student will 
be called on the Rth call (event Ez). Thus P[R =r, 7] S Pr(r,n) = Pl Ey E2] = PLE,)P[E] 
since E, and E» are independent. Now P[F2] = 1/n since it is merely the probability that 
a given ball goes into a selected cell, and P[E\] is Pi(r —1,n) of Equation 1.8-13, that is 


P,(r —1,n) = (7) es (7) (-1)' (1 - » r>n 


= 0, else. 
Thus Pr(r,n) is given by 
n—-1 nm i a+ 1 ioe 
Pr(r,n) = ae e (<5 (1 3 ) , ran (4.1-27) 
= 0, else. 


The probability Px (k,n) that all n cells (students) have been filled (called) after distributing 
k balls (called k students) is, from Equation 1.8-9 


Px(k,n) = Se (7) (-1) (1 = a , k>n (4.1-28) 


= 0, else. 


Finally, the expected value of the RV R is given by 


E(R] = ae r oa (7) (-1) (1 _ - y) (4.1-29) 


Example 4.1-13 
Write a MATLAB program for computing the probability that all the students in Example 
4.1-12 are called upon at least once in r calls from the electronic Rolodex. Assume there 
are 20 students in the class. 


Solution The appropriate equation to be coded is Equation 4.1-28. The result is shown 
in Figure 4.1-3. 
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Probability of all 20 students in the class being called in rtries 
1 


0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 


Probability of all 20 students being called 


0 | 
0 20 40 60 80 100 120 140 160 180 200 
Number of tries 


Figure 4.1-3 MATLAB result for Example 4.1-13. 


function [tries,prob]=occupancy(balls,cells) 
tries=1:balls; % identifies a vector ‘‘tries’’ 
prob=zeros(1,balls); % identifies a vector ‘‘prob’’ 
a=zeros(1,cells); % identifies a vector ‘‘ 
d=zeros(1,cells); % identifies a vector ‘‘d’’ 


a’? 


term=zeros(1,cells); % identifies a vector ‘‘term’’ 
% next follows the realization of Equation (4.1-27) 
for m=1:balls 


for k=1:cells 
a(k)=(-1)°k)*prod(1:cells)/(prod(1:k)*prod(1:cells-k)); 
d(k)=(1-(k/cells))*m; 
term(k)=a(k) *d(k) ; 
end 
prob(m)=1+sum(term) ; 
end 
plot (tries, prob) 
title([‘Probability of all ’ num2str(cells) ’ students in the class 
being called in r tries’]) 
xlabel(‘number of tries’) 
ylabel([‘Probability of all ’ num2str(cells) ’ students being called 
‘1D 
Example 4.1-14 
Write a MATLAB program for computing the average number of calls required for each 


student to be called at least once. Assume a maximum of 50 students and make sure the 
number of calls is large (n > 400). 
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Average number of Rolodex tries to call all students at least once 
250 - - : 


200 
150 
100 


50 


Expected number of Rolodex tries 


0 : : : 
0 5 10 15 20 25 30 35 40 45 50 
Number of students in the class 


Figure 4.1-4 MATLAB result for Example 4.1-14. 


Solution The appropriate equation to be coded is Equation 4.1-29. The result is shown 
in Figure 4.1-4. 


function [cellvec,avevec]=avertries(ballimit,cellimit) ; 
cellvec = i:cellimit; 
termvec = zeros(1,ballimit) ; 
avevec = zeros(1,cellimit) ; 
brterm=zeros(1,ballimit) ; 
srterm=zeros(1,ballimit) ; 
for n=1:cellimit; 
a = zeros(1,n); 
d = zeros(1i,n); 
termvec = zeros(1i,n); 
for r=1:ballimit 
for i=1:n-1 
a(i) = ((-1)*i)*prod(1:n-1)/(prod(1:i)*prod(1:n-1-i)); 
d(i) = (1-(€(i-1)/n))*(r-1); 
termvec(i) = a(i)*d(i); 
end 
brterm(r)=r*sum(termvec) ; 
lrterm(r)=r*((1-(1/n)))*(r-1); 
end 
avevec(n)=sum(brterm)+sum(1lrterm) ; 
end 
plot (cellvec,avec, ‘o’) 
title(‘Average number of Rolodex tries to call all students at least 
once’ ) 
xlabel(‘number of students in the class’) 
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ylabel( ‘Expected number of Rolodex tries to reach all students at least 
once’) 

grid 

Example 4.1-15 
(geometric distribution) The RV X is said to have a geometric distribution if its probability 
mass function is given by 


Px(n) = (1 — a)a"u(n), 


where wu is the unit-step function’ and 0 < a < 1. Clearly ©9°_)Px(n) = 1, a result easily 
obtained from °°_ja" = (1 —a)~! for 0 < a <1. The expected value is found from 


l-a 


E[X] == (1—a) ) na” =(1—a) xax £((1—a)1} = 
n=0 


Solving for a, we obtain 
m 


a= F 
1+ yp 
Thus, we can rewrite the geometric PMF as 


Px(n) = = (A) u(n). 


Note: There is another common definition of a geometric RV where the PMF support is 
[1, co) instead of [0,00). The corresponding geometric law appeared early in Example 1.9-4. 


Its PMF would take the form Px(n) = (1 —a)a"~'u(n —1), that is, the same sequence of 
numbers shifted right one place. 


4.2 CONDITIONAL EXPECTATIONS 


In many practical situations we want to know the average of a subset of the population: 
the average of the passing grades of an exam; the average lifespan of people who are still 
alive at age 70; the average height of fighter pilots (many air forces have both an upper and 
lower limit on the acceptable height of a pilot); the average blood pressure of long-distance 
runners, and so forth. Problems of this type fall within the realm of conditional expectations. 

In conditional expectations we compute the average of a subset of a population that 
shares some property due to the outcome of an event. For example in the case of the average 
of passing grades, the subset is those exams that received passing grades. What all these 
exams share is that their grade is, say, >65. The event that has occurred is that they 
received passing grades. 


Definition 4.2-1 The conditional expectation of X given that the event B has 


occurred is 
Co 


E[X|B] 2 / xfx\p(a|B) de. (4.2-1) 


—co 


+That is, u(n) = 1 for n > 0 and u(n) = 0, else. 
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If X is discrete, then Equation 4.2-1 can be replaced with 


E[X|B] 2 7 2iPxin(ei|B). (4.2-2) 


To give the reader a feel for the notion of conditional expectation, consider the following 
exam scores in a course on probability theory: 28, 35, 44, 66, 68, 75, 77, 80, 85, 87, 90, 
100, 100. Assume that the passing grade is 65. Then the average score is 71.9; however, the 
average passing score is 82.8. A closely related example is worked out as follows. 


Example 4.2-1 
(conditional expectation of uniform distribution) Consider a continuous RV X and the event 


BS {X > a}. From Equations 2.6-1 and 2.6-2 and a little bit of work, we obtain 


0, e <a, 
Fx\p(a|X >a) = 4 Fx(x) - Fx(a)_ — (4.2-3) 
1- Fy (a) 
Hence 
0, u<a, 
fxip(2|X >a) = fx (x) (4.2-4) 
Se r>a 
1- Fx (a) 
and 


E[X|X > a] = 4. (4.2-5) 


Assume that X is a uniform RV in [0, 100]. Then 


1 100 


but using Equation 4.2-5 with a = 65 
E|X|X > 65] = 82.5. 


Conditional expectations often occur when dealing with RVs that are related in some way. 
For example let Y denote the lifetime of a person chosen at random, and let X be a binary 
RV that denotes whether the person smokes or not, that is, XY = 0 if anonsmoker, X = 1 ifa 
smoker. Then clearly E[Y|X = 0] is expected to be larger! than E[Y|X = 1]. Or let X be the 


Statistical evidence indicates that each cigarette smoked reduces longevity by about eight minutes. 
Hence smoking one pack a day for a whole year reduces the expected longevity of the smoker by 40 days! 
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intensity of the incident illumination and let Y be the instantaneous photocurrent generated 
by a photodetector. Typically the expected value of Y will be larger for stronger illumination 
and smaller for weaker illumination. We define some important concepts as follows. 


Definition 4.2-2 Let X and Y be discrete RVs with joint PMF Py y(a;, y;). Then 
the conditional expectation of Y given X = x; denoted by E[Y|X = aj] is 


BLY |X = 21] 5S) ug Prix (sla). (4.2-6) 


Here Py|x(y;|;) is the conditional probability that {Y = y,;} occurs given that {X = a;} 
has occurred and is given by Px y(x;,y;)/Px(«:). 1 


We can derive an interesting and useful formula for E[Y] in terms of the conditional 
expectation of Y given X = x. The reasoning is much the same as that which we used in 
computing the average or total probability of an event in terms of its conditional probabil- 
ities (see Equation 1.6-7 or 2.6-4). Thus, 


ELY] = >) ujPr (ys) (4.2-7) 


=) > Pax) 
g a 


=~ > », yj Py|x (yj|ai) | Px (2) 
=) EIY|X = 2] Px(zi). (4.2-8) 


Equation 4.2-8 is a very neat result and says that we can compute E[Y] by averaging the 
conditional expectation of Y given X with respect to X.' Thus, in the smoking-longevity 
example discussed earlier, suppose E[Y|X = 0] = 79.2 years and E[Y|X = 1] = 69.4 years 
and Px (0) = 0.75 and Px(1) = 0.25. Then 


E[Y] = 79.2 x 0.75 + 69.4 x 0.25 = 76.75 


is the expected lifetime of the general population. 
A result similar to Equation 4.2-8 holds for the continuous case as well. It is derived 
using Equation 2.6-85 from Chapter 2, that is, 


fy\x(ylz) = fe fx(z) £0. (4.2-9) 


The definition of conditional expectation for a continuous RV follows. 
tNotice that this statement implies that the conditional expectation of Y given X is an RV. We shall 


elaborate on this important concept shortly. For the moment we assume that X assumes the fixed value x; 
(or x). 
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Definition 4.2-3 Let X and Y be continuous RVs with joint pdf fxy(z,y). Let the 
conditional pdf of Y given that X = x be denoted as in Equation 4.2-9. Then the conditional 
expectation of Y given that X = x is given by 


Biy|x=a)8 f yfx(vle)dy (4.2-10) 


Since 


ey|= f 7 / © ufev(0,y) de dy, (4.2-11) 


it follows from Equations 4.2-9 and 4.2-10 that 
ev= [tele | [ vfxtulerty) ao 


= /. E{Y |X =a] fx(a) de. (4.2-12) 


Equation 4.2-12 is the continuous RV equivalent of Equation 4.2-8. It can be used to good 
advantage (over the direct method) for computing E[Y]. We illustrate this point with an 
example from optical communications. 


Example 4.2-2 
(conditional Poisson) In the photoelectric detector shown in Figure 4.2-1, the number of 
photoelectrons Y produced in time 7 depends on the (normalized) incident energy X. If X 
were constant, say X = x, Y would be a Poisson RV [4-4] with parameter x, but as real light 


i(t) 


Current pulse due to 
single photoelectron 


/ 


Photodetector 


Incident light t 
avavauat— 
i(t) ——~> Output 
+ 


Figure 4.2-1 In a photoelectric detector, incident illumination generates a current consisting of photo- 
generated electrons. 
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sources—except for gain-stabilized lasers—do not emit constant energy signals, X must be 
treated as an RV. In certain situations the pdf of X is accurately modeled by 


fx(a) = 2 x e-. oa (4.2-13) 


where juy is a parameter that equals E[X]. We shall now compute E[Y] using Equation 4.2-12 
and using the direct method. 


Solution Since for X = 2, Y is Poisson, we can write 


k 


PIY =k|X =2|=—e* k=0,1,2,... 


and, from Example 4.1-6, 
E|Y|X =a] =. 


Finally, using Equation 4.2-12 with the appropriate substitutions, that is, 


ee 1 
E[Y] =H x = exp (-+)| dx, 
0 Ux Hx 
we obtain, by integration by parts, 
EY] = px. 


In contrast to the simplicity with which we obtained this result, consider the direct approach, 
that is, 


ElY] = S°kPy(k). (4.2-14) 
k=0 
To compute Py(k) we use the Poisson transform (Equation 2.6-14) with fx (x), as given by 


Equation 4.2-13. This furnishes (see Equation 2.6-23) 


k 
Py(k) = ier (4.2-15) 


Finally, using Equation 4.2-15 in 4.2-14 yields 


It is known that this series sums to x. Alternatively one can evaluate the sum indirectly 
using some clever tricks involving derivatives. 
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Example 4.2-3 
(conditional Gaussian density) Let X and Y be two zero-mean RVs with joint density 


1 x? + y? — 2pry 

frv(= men (—SaG ay) Waist (4246) 
We shall soon find out (Section 4.3) that the pdf in Equation 4.2-16 is a special case of 
the general joint Gaussian law for two RVs. First we see that when p 4 0, fxy(a,y) 4 
fx (x) fy(y); hence X and Y are not independent when p 4 0. When p = 0, we can indeed 
write fxy(a,y) = fx(x) fy (y) so that p = 0 implies independence. For the present, however, 
our unfamiliarity with the meaning of p (p is called the normalized covariance or correlation 
coefficient) is not important. When p is zero, X and Y are zero-mean Gaussian RVs, that is, 


fx(2) _ fy (2) = Toe 


However, the conditional expectation of Y given X = x is not zero even though Y is a 
zero-mean RV! In fact from Equation 4.2-9, 


frix(ula) = yer). (42-17) 


ex 
InoX{1— py) ( 2a*(1— p*) 


Hence fy|x(y|x) is Gaussian with mean pa. Thus, 


EIV|X =a] = [ 


Co 


yfy|x(ylx) dy 


= pu. (4.2-18) 


When p is close to unity, E[Y|X = a] ~ x, which implies that Y tracks X quite closely 
(exactly if p = 1), and if we wish to predict Y, say, with Yp upon observing X = 2, 
a good bet is to choose our predicted value Yp = x. On the other hand, when p = 0, 
observing X doesn’t help us to predict Y. Thus, we see that in the Gaussian case at least 
and somewhat more generally, p is related to the predictability of one RV from observing 
another. A cautionary note should be sounded, however: The fact that one RV doesn’t help 
us to linearly predict another doesn’t generally mean that the two RVs are independent. 


Example 4.2-4 
(expectation conditioned on sums of RVs) Consider the two independent, discrete, RVs kK, 
and K2. We wish to compute E[K,|K, + Ky = mj. It is first necessary to determine the 
conditional probability P[K, = ki|Ky, + K2 = mJ]. This conditional probability can be 
written as 


Pik, =k, Ki + Ko =m] 
P|K, + Ky =m] 
_ Pik = ki, Ko =m—ki] (4.2-19) 
Pi|Kki + Ko =m] : 
Pik => ky|P[ Ko =m — ky] 
P|Ki + Ko =m] ; 


Pik = ki|K, + Ko =m) = 
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Let Ky, and Ko be each distributed as Poisson with the same parameter @. Since these RVs 
are independent and identically distributed we designate them as i.i.d. RVs. Then, from 
Equation 4.2-19 and Example 3.3-8 we get 


Plky = ky |Ky + Ko = ml 


_(e04 fh!) x (€-%20-* (mn — kx)!) 
e~ (1 +92) (9, + 03)™/m! 


(4.2-20) 
~ (7) Oy105°™ x (81 + 02)-™. 


Now recall that E[Ky|Ki+ Ko = m| 2 Of"_y ki P[K1 = ki|K1 + Ko = ml and the binomial 
expansion formula is given by }>;_9 (;;) 0503 —-* = (0, +62)". Then using Equation 4.2-20 
finally yields 


E|K,|K, + Ky =m] =m x ( 1 ) (4.2-21) 


Example 4.2-5 
(continuation of Example 4.2-4) Let Ky, Ka, K3 denote multinomial RVs for | = 3, that is, 
a three-nomial (three outcomes possible). Then for n trials, we have the PMF' 


Px(k1, ke, k3) = PL Ay = ky, Ko = ko, K3 = kz] 


_ TPT De? Ds ky tho +k3 =n, all k; > 0, (4.2-22) 
0, else, 


where p; + po + p3 = 1. We wish to compute E[Ay|Ki + Ky =m. 


Solution As in the previous example, we need to compute P[k, = ki|Ky, + Ko =m]. We 
write 
Pi Ki = ky, Ky + Ko = m| 


P= hilo == P|Ki + Ko =m] 


Note that for the multinomial, the event {¢: Ai(¢) + Ko(¢) = m}N {¢: Ky(¢) = ki} is 
identical to the event {¢: Ki (¢) = ki, Ko(¢) =m — ky, K3(¢) =n — m}. Hence 


tNote the notation different from that in the binomial case. Using this new multinomial notation for 
the binomial case, we would have, for a binomial RV K: Ky, = K and Kg = n-— K. In the general /-nomial 
distribution we must always abide by the constraint that Ky + Ko+...4+ Ki; =n. 
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Pi Ky =k, Ko =m—ky, K3=n-—m] 


P[K, = h|Ky+Kg=m|= Pik =n ma (4.2-23) 
= n! ki, m—ki,n—m 
~ kim — kin — ml?! P28 
n! 


> @=mimr (lps) 


m 1 ,,m—k —m 
(0) pS +m sah 
Finally, using 


El Ky |ky + Ko = m| => So ki P[k = ky |Fy + Ko = m|, 


ky 
we obtain that 
E[Ky|Ky + Ko =m] =m—_. (4.2-25) 
Pi + pe 
We leave it to the reader to compute that 
p2 
E|K2|Kki + Ko =m] = ee (4.2-26) 


These kinds of problems occur in the estimation procedure known as the expectation- 
maximization algorithm, discussed in detail in Chapter 11. 


Conditional Expectation as a Random Variable 


Consider, for the sake of being specific, a function Y = g(X) of a discrete RV X. Then its 
expected value is 


BY] = Dales) Px (as) 


= Elg(X)}. 
This suggests that we could write Equation 4.2-8 in similar notation, that is, 


E\Y| = os E[Y|X = 2] Px (2) 


= E[ElY|X]]. (4.2-27) 


It is important to note that the object E[Y|X = «| is a number, as is g(x,), but the 


object E[Y|X] is a function of the RV X and therefore is itself an RV. Given a probability 
space = (Q,AP) and an RV X defined on Y for each outcome ¢ € 2 we generate 
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the real number E[Y|X = X(¢)]. Thus, for ¢ variable E[Y|X] is an RV that assumes the 
value E[Y|X = X(¢)] when ¢ is the outcome of the underlying experiment. As always, the 
functional dependence of X on ¢ is suppressed, and we specify X rather than the underlying 
probability space Z The following example illustrates the use of the conditional expectation 
as an RV. 


Example 4.2-6 
(multi-channel communications) Consider a communication system in which the message 
delay (in milliseconds) is T and the channel choice is L. Let L = 1 for a satellite channel, 
L = 2 for a coaxial cable channel, L = 3 for a microwave surface link, and L = 4 for a fiber- 
optical link. A channel is chosen based on availability, which is a random phenomenon. 
Suppose P;(l) = 1/4, 1 = 1,...,4. Assume that it is known that E/[T|L = 1] = 500, 
E({T|L = 2] = 300, E[T|L = 3] = 200, and E[T|L = 4] = 100. Then the RV g(L) = E(T|L) 
is defined by 


500, forL=1 P,(1)= 


( 
lL) = 300, for L=2 a 
( 


200, forL=3 P, 
100, forL=4 P,(4) = 


BIR BIR BIR BIE 


and E[T] = E[g(L)] = 500 x + +300 x 4 +200 x | +100 x ¢ = 275. 


The notion of E[Y |X] being an RV is equally valid for discrete, continuous, or mixed RVs X. 
For example, Equation 4.2-12 


E|Y| = a E|Y|X = a] fx(ax) dx 


can also be written as E[Y] = E[E[Y|X]], where E[Y|X] in this case is a function of the 
continuous RV X. The inner expectation is with respect to Y and the outer with respect 
to X. 

The foregoing can be extended to more complex situations. For example, the object 
E|Z|X,Y] is a function of the RVs X and Y and therefore is a function of two RVs. For 
a particular outcome ¢ € (2, it assumes the value E[Z|X(¢), Y(¢)]. To compute E[Z] we 
would write E[Z] = E[E|Z|X,Y]], which, for example, in the case of continuous RVs yields 


E[Z] = ElE(Z|X,Y]] 


=f ff tary Cle) fav o)dedyae (4.2-28) 


We conclude this section by summarizing some properties of conditional expectations. 
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Properties of Conditional Expectation:. 
Property (i). E[Y] = E/E[Y|X]]. 


Proof See arguments leading up to Equation 4.2-8 for the discrete case and Equa- 
tion 4.2-12 for the continuous case. The inner expectation is with respect to Y, the outer 
with respect to X. 


Property (ii). If X and Y are independent, then E[Y|X] = E[Y]. 


Proof 


co 


yfy|x(ylx)dy. 


BIY|X =a] = [ 


But fxy(z,y) = fyjx(ylz)fx(x) = fy(y)fx(x) if X and Y are independent. Hence 
fy\|x(y|z) = fr(y) and 


Co 


BIV|X =a] = [ vfv(v)dy = EIY] 


for each x. Thus, 
EIvIx|= f ufv()dy = BLY 


An analogous proof holds for the discrete case. 
Property (iii). E[Z|X] = E[E[Z|X,Y]|X]. 


Proof 


Co 


E[Z|X = a] -| zfz\x(z|x)dz 


—co 


= i: zfz\x,y(2lz,y) fy|x (ylx)dz dy 


-| dy Fvix(ula) | zfa\x,y (z|x, y)dz 


—Co 


= E[E|Z|X,Y]|X =a], 


where the inner expectation is with respect to Z and the outer with respect to Y. Since 
this is true for all x, we have E[Z|X] = E[E[Z|X,Y]|X]. The mean py = E[Y] is an 
estimate of the RV Y. The mean-square error in this estimate is «? = E[(Y — py)?]. 
In fact this estimate is optimal in that any constant other than py would lead to an 
increased <7. | 
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4.3 MOMENTS OF RANDOM VARIABLES 


Although the expectation is an important “summary” number for the behavior of an RV, 
it is far from adequate in describing the complete behavior of the RV. Indeed, we saw in 
Section 4.1 that two sets of numbers could have the same sample mean but the sample 
deviations could be quite different. Likewise, for two RVs their expectations could be the 
same but their standard deviations could be very different. Summary numbers like px, 0%, 
E[X?], and others are called moments. Generally, an RV will have many nonzero higher- 
order moments and, under certain conditions (Section 4.5), it is possible to completely 
describe the behavior of the RV, that is, reconstruct its pdf from knowledge of all the 
moments. In the following definitions we shall assume that the moments exist. However, 


this is not always the case. 


Definition 4.3-1 The rth moment of X is defined as 
m, & E[X"] = / a" fx(x)dx, where r=0,1,2,3,.... (4.3-1) 


—co 


If X is a discrete RV, the rth moment can be computed from the PMF as 
Mr = Saf Px (ai). 


We note that mp = 1, mi = (the mean). 
Definition 4.3-2 The rth central moment of X is defined as 
cy SEl(X—p)"], — where r = 0,1,2,3,.... (4.3-2a) 


For a discrete RV we can compute c, from 
A r 
cr = $7 (ai - pw)" Px(a). (4.3-2b) 


The most frequently used central moment is cg. It is called the variance and is denoted by 
a? and also sometimes by Var[X]. Note that co = 1, c: = 0, cg = 07. An important formula 
that connects the variance to E[X?] and y is obtained as follows: 


o? = E[[X — p)?] = E[X?] — E2wX] 4+ Elp’y. 


But for any constant a, E[aX] = aE[X] and E[a?] = a?. Thus 
o? = E[X?] — 2uE[X] + 2? 
= E[X?| - (4.3-3) 


since E[X] & i. In order to save symbology, an overbar is often used to denote expectation. 


Thus x7 4 EB [X"], and so forth, for other moments. Using this notation, Equation 4.3-3 
appears as 


aX? - (4.3-4a) 
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or, equivalently, 
X2= 67 +17. (4.3-4b) 


Equations 4.3-4a relates the second central moment cg to fy and py. We can generalize this 
result as follows. Observe that 


r 


(x-p"r=S> (*) (ix (4.3-5a) 


i=0 


By taking the expectation of both sides of Equation 4.3-5a and recalling the linearity of the 
expectation operator, we obtain 


= : (7) aati (4.3-5b) 


Example 4.3-1 
Let us compute mz for X, a binomial RV. By definition 


Px(k) = ({) pg * 


and 


= p’n(n—1)4+ np 
= n*p? + npg. (4.3-6) 


In going from line 2 to line 3 several steps of algebra were used whose duplication we leave 
as an exercise. In going from line 3 to line 4, we rearranged terms and used the fact that 


q aie p. The expected value of X is 


“ nl k n—k 
m=) “Tamme? 
k=0 
=np= i. (4.3-7) 


Using this result in Equation 4.3-6 and recalling Equation 4.3-4 allow us to conclude that 
for a binomial RV with PMF b(k;n, p) 


o? = npq. (4.3-8) 
For any given n, maximum variance is obtained when p = q = 0.5 (Figure 4.3-1). 


Example 4.3-2 
(second moment of zero-mean Gaussian) Let us compute central moment cz for X : N(0,07). 
Since wp = 0, co = mz and 
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1 p 


| 
| 
Blw 


Figure 4.3-1 Variance of a binomial RV versus p. 


. -H2/0? gy, 


ak / ~ 
= —— xe 
V 2107 J—oo 
But this integral was already evaluated in Example 4.1-7, where we found E[X?] = o?. 
Thus, the variance of a Gaussian RV is indeed the parameter o? regardless of whether X is 
zero-mean or not. 


An interesting and somewhat more difficult example that illustrates a useful application 
of moments is given next. 


Example 4.3-3 
(entropy) The maximum entropy (ME) principle states that if we don’t know the pdf fx (x) 
of X but would like to estimate it with a function, say p(x), a good choice is the function 
p(x) which maximizes the entropy, defined by [4-5], 


H[x) =— / 7 p(x) In p(x) dx (4.3-9) 


—oo 


and which satisfies the constraints 


p(x) 20 (4.3-10a) 

] p(x) dz =1 (4.3-10b) 
: xp(x) dx = wu (4.3-10c) 
/ x’ p(x) dx = mz, and so forth. (4.3-10d) 


Suppose we know from measurements or otherwise only yw in Equation 4.3-10c and that 
x > 0. Thus, we wish to find p(x) that maximizes H[X] of Equation 4.3-9 subject to 


Sec. 4.3. MOMENTS OF RANDOM VARIABLES 245 


the first three constraints of Equation 4.3-10. Using the method of Lagrange multipliers 
[4-6], the solution is obtained by maximizing the expression 


— f° r(e)inple) de dx [ playate— do f° apa) ae 


by differentiation with respect to p(a). The constants A; and 2 are Lagrange multipliers 
and must be determined. After differentiating we obtain 


Inp(a) = —(1+ Aq) — Aow 


or 


aes, (4.3-11) 
When this result is substituted in Equations 4.3-10b and 4.3-10c, we find that 


1 
eee Se. gS h; 


and 


Hence our ME estimate of fx (a) is 


Sew t/b >0 
€ ? x — ? 
p(w) = 4 (4.3-12) 
0, 


xz<0. 


The problem of obtaining the ME estimate of fx(2) when both pw and o? are known is left 


as an exercise. In this case p(x) is the Normal distribution with mean pz and variance o?. 


Tables of common means, variances, and mean-square values. Table 4.3-1 is a table 

of means, variances, and mean-square values for common continuous RVs. Some of these 

have been calculated already in the text. Others are left as end-of-chapter problems. 
Table 4.3-2 is a similar table for common discrete RVs. 

Less useful than m,. or c, are the absolute moments and generalized moments about some 

arbitrary point, say a, defined by, respectively, 


E(|x|"|4 / \z\"fx(x)dx (absolute moment) 


E[(X —a)"] = fe —a)' fx(a)dx (generalized moment). 


Note that if we set a = 4, the generalized moments about a are then the central moments. 
If a = 0, the generalized moments are simply the moments m,. 
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Table 4.3-1 Means, Variances and Mean-Square values for Common Continuous RVs 


Family pdf f(z) Mean pw = E[X] Variance co” Mean square E[X°] 
: 1 1 2 lio 2 
Uniform U(a, b) (a +b) 7p o- 2) 30 +ab+a*) 
Exponential 1 6-2 /Hu(2) mn tm 2p? 
[ 
1 _fe=n)? 
Gaussian Se 2? o 24g? 
V 210 e t 
1 va 
Laplacian eo Il 0 o o 
e V20 
7 see ca T\ 95 5 
Rayleigh 52° 207 u(x) 5° (2 — “) oO 20 


Table 4.3-2 Means, Variances, and Mean-Square Values for Common Discrete RVs 


Family PMEF P(k) Mean yu = E|K] Variance o” Mean 
square 
E|K?| 
Bernoulli Pp(k) = ig2- 1l-p Dp pq 7) 
Binomial b(k;n,p) = (Z)p qr * np npq (np)” + npq 
P 1 Le ‘ 2 2 
: trict fe k 2 
Geometric a(t u(k) LL wtp pet Qu 
ak 
Poisson are. ule) a a ata 


Joint Moments 


Let us now turn to a topic first touched upon in Example 4.2-3. Suppose we are given 
two RVs X and Y and wish to have a measure of how good a linear prediction we can 
make of the value of, say, Y upon observing what value X has. At one extreme if X and 
Y are independent, observing X tells us nothing about Y. At the other extreme if, say, 
Y =aX +b, then observing the value of X immediately tells us the value of Y. However, in 
many situations in the real world, two RVs are neither completely independent nor linearly 
dependent. Given this state of affairs, it then becomes important to have a measure of 
how much can be said about one RV from observing another. The quantities called joint 
moments offer us such a measure. Not all joint moments, to be sure, are equally important 
in this task; especially important are certain second-order joint moments (to be defined 
shortly). However, as we shall see later, in various applications other joint moments are 
important as well and so we shall deal with the general case below. 


+ The geometric PMF is sometimes written in terms of the parameter a as (l—a)a*u(k) withO <a< 1. 
Then pp = a/(1—a) with pw > 0. 
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Definition 4.3-3 The 7jth joint moment of X and Y is given by 


I> 


E[X'y?] 


M45 


l| 


_, - xy) fxy (a, y) dx dy. (4.3-13) 
If X and Y are discrete, we can compute ju,; from the PMF as 
mig = = Do tivin Pxy(21,Ym)- (4.3-14) 
Definition 4.3-4 The ijth joint central moment of X and Y is defined by 
ci; 2 El(X —X)(¥ -Y)], (4.3-15) 


where, in the notation introduced earlier, X 2 E[X], and so forth, for Y. The order of the 
moment is i+ 7. Thus, all of the following are second-order moments: 


mo2 = E(Y?] coz = E[(Y — Y)*] 
mo = E[X?] — ca9 = E[(X — X)?] 
mi = E[|XY] c= E(x ~ X\(Y = ¥)] 
= E|XY|-—XY 
2 Cov[X,Y]. 


As measures of predictability and in some cases statistical dependence, the most important 
joint moments are mj, and c1,; they are known as the correlation and covariance of X and 
Y, respectively. The correlation coefficient’ defined by 


A 411 


vy €02€20 


was already introduced in Section 4.2 (Equation 4.2-16). It satisfies |p| < 1. To show this 
consider the nonnegative expression 


E[(A(X — px) — (¥ - py))?] = 0, 


where X is any real constant. To verify that the left side is indeed nonnegative, we merely 
rewrite it in the form 


p (4.3-16) 


ef fx (0 — px) — (y— by) 2 far (ayy) de dy > 0, 


+Note that it would be more properly termed the covariance coefficient or normalized covariance. 
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where the > follows from the fact that the integral of a nonnegative quantity cannot be 
negative. 
The previous equation is a quadratic in . Indeed, after expanding we obtain 
Q(A) = M axg + Co2 — 2XC14 2 0. 


Thus Q(A) can have at most one real root. Hence its discriminant must satisfy 


or 
chy < Co2C20 (4.3-17) 


whence the condition |p| < 1 follows. 
When c?, = C0220, that is, |p| = 1, it is readily established that 


(Bx) - m)) [=0 


C20 


E 


or, equivalently, that 


fore) fore) 2 
C11 
ff (Been <1) tev (ou) dedy= 0. (43.18) 
Since fxy(x,y) is never negative, Equation 4.3-18 implies that the term in parentheses is 
zero everywhere.’ Thus, we have from Equation 4.3-18 that when |p| = 1 


¥ = (x - py) + Hy, (4.3-19) 
C20 


that is, Y is a linear function of X. When Cov[X,Y] = 0, p = 0 and X and Y are said to 
be uncorrelated. 


Properties of Uncorrelated Random Variables 


(a) If X and Y are uncorrelated, then 
Oxy = Oy + OY, (4.3-20) 


where 


oxy = El(X + Y)] — (B[X + Y])?. 


+Except possibly over a bizarre set of points of zero probability. To be more precise, we should exchange 
the word “everywhere” in the text to “almost everywhere,” often abbreviated a.e. 
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(b) If X and Y are independent, they are uncorrelated. Proof of (a): We leave this as 
an exercise to the reader; proof of (b): Since Cov[X,Y] = E[XY] — E|X]E[Y], we must 
show that E[XY] = E[X]E[Y]. But 


eixyy= ff ae xy fxy (a, y) dx dy 


= / ufx (a) dx / yfy(y) dy (by independence assumption) 


=66 = 


= E|XJElY). wy 


Example 4.3-4 
(linear prediction) Suppose we wish to predict the values of an RV Y by observing the 
values of another RV X. In particular, the available data (Figure 4.3-2) suggest that a good 
prediction model for Y is the linear function 


Yp 2aX+8. (4.3-21) 


Now although Y may be related to X, the values it takes on may be influenced by other 
sources that do not affect X. Thus, in general, |p| 4 1 and we expect that there will be 
an error between the predicted value of Y, that is, Yp, and the value that Y actually 
assumes. Our task becomes then to adjust the coefficients a and @ in order to minimize the 
mean-square error 


2A 


e SEY -—Yp)*|. (4.3-22) 


This problem is a simple version of optimum linear prediction. In statistics it is called linear 
regression. 


Solution Upon expanding Equation 4.3-22, we obtain 
e? = E[Y?] — 2opyy — 26py + 2aBpy + 0? E[X?] + 6’. 


x 


Figure 4.3-2 Pairwise observations on (X, Y) constitute a scatter diagram. The relationship between 
X and Y is approximated with a straight line. 
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To minimize ¢ with respect to a and (3, we solve for a and ( that satisfy 
Oe? Oe? 

—0 = 
Op 


Oa 
This yields the best a and 3, which we denote by ao, 8p in the sense that they minimize e. 
A little algebra establishes that 


0. (4.3-23) 


_ Cov[X,Y] _ poy 


= = 4.3-24 
Qo a a (4.3-24a) 
and 
= Co lx, YJ> 
Bo = = o2 xX 
x 
= i. 
=p (4.3-24b) 
ox 


Thus, the best linear predictor is given by 
oy 
Yp — by = p—(X — px) (4.3-25) 
ox 


and passes through the point (j1x, j4y-). If we use ao, G9 in Equation 4.3-22 we obtain the 
smallest mean-square error €2,,,, which is Problem 4.33, 


fa SO AL — 2"), (4.3-26) 


Something rather strange happens when p = 0. From Equation 4.3-25 we see that for 
p= 0, Yp = py regardless of X! This means that observing X has no bearing on our 
prediction of Y, and the best predictor is merely Yp = py. We encountered somewhat the 
same situation in Example 4.2-3. Thus, associating the correlation coefficient with ability to 
predict seems justified in problems involving linear prediction and the joint Gaussian pdf. 
In some fields, a lack of correlation between two RVs is taken to be prima facie evidence 
that they are unrelated, that is, independent. No doubt this conclusion arises in part from 
the fact that if two RVs, say, X and Y, are indeed independent, they will be uncorrelated. 
As stated earlier, the opposite is generally not true. An example follows. 


Example 4.3-5 
(uncorrelated is weaker than independence) Consider two RVs X and Y with joint PMF 
Px y(xi,y;) as shown. 


Values of Px y (xi, y;) 


ry =-1 r2 = 0 r3= +1 


yi = 0 0 0 


So | ole 


= 1 1 
yo=1 3 3 
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X and Y are not independent, since Pyy(0,1) = 0 4 Px(0)Py(1) = z. Furthermore, 
fix = 0 so that Cov(X,Y) = E[XY] — wx py = E[XY]. We readily compute 


mi = (—1)(1)$ ale (1)(1)3 = 0. 


Hence X and Y are uncorrelated but not independent. 


There is an important special case for which p = 0 always implies independence. We now 
discuss this case. 


Jointly Gaussian Random Variables 
We say that two RVs are jointly Gaussiant (or jointly Normal) if their joint pdf is 


1 -1 thx) 
£4) = x ex 
fx (ay) Wroxoy\/1— p? (a — p?) ( ox 


opt ex)y = By) 2 (ey) (4.3-27) 


Oxoy Oy 


Five parameters are involved: ox, oy, [lx, [ly, and p. If p = 0 we observe that 


fxy(z,y) = fx(2)fy(y), 


where 


fx(x) = eee ( . (—*) (4.3-28) 


and 


Phe== es ==), (4.3-29) 
ae ane, . 2 Oy ; 


Thus, two jointly Gaussian RVs that are uncorrelated (i.e., 9 = 0) are also independent. The 
marginal densities fy(a) and fy(y) for jointly normal RVs are always normal regardless of 
what p is. However, the converse does not hold; that is, if fx(x) and fy(y) are Gaussian, 
one cannot conclude that X and Y are jointly Gaussian. 

To see this we borrow from a popular x-ray imaging technique called computerized 
tomography (CT) useful for detecting cancer and other abnormalities in the body. Suppose 
we have an object with x-ray absorptivity function f(«,y) > 0. This function is like a joint 
pdf in that it is real, never negative, and easily normalized to a unit volume—however, this 
last feature is not important. Thus, we can establish a one-to-one relationship between a 
joint pdf fxy(a,y) and the x-ray absorptivity f(x,y). In CT, x-rays are passed through the 


+The jointly Normal pdf is sometimes called the two-dimensional Normal pdf in anticipation of the 
general multi-dimensional Normal pdf. The later becomes very cumbersome to write without using matrix 
notation (Chapter 5). 
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object along different lines, for some fixed angle, and the integrals of the absorptivity are 
measured and recorded. Each integral is called a projection and the set of all projections 
for given angle @ is called the profile function at @. Thus, the projection for a line at angle 
6 and displacement s from the center is given by [Figure. 4.3-3(a)] 


fe) ‘A eg fev 


where L(s,@) are the points along a line displaced from the center by s at angle 6 and dl 
is a differential length along L(s,6). If we let s vary from its smallest to largest value, we 
obtain the profile function for that angle. By collecting all the profiles for all the angles and 
using a sophisticated algorithm called filtered-convolution back-projection, it is possible to 
get a high-quality x-ray image of the body. Suppose we measure the profiles at 0 degrees 
and 90 degrees as shown in Figure 4.3-3(b). Then we obtain 


fi(z) = i- f(x, y)dy (horizontal profile) 


foly) = f(a, y) da (vertical profile). 


If f(x,y) is Gaussian, then we already know that f(a) and f2(y) will be Gaussian because 
fi; and fz are analogous to marginal pdfs. Now is it possible to modify f(x,y) from Gaussian 
to non-Gaussian without observing a change in the Gaussian profile? If yes, we have demon- 
strated our assertion that Gaussian marginals do not necessarily imply a joint Gaussian 
pdf. In Figure 4.3-3(c) we increase the absorptivity of the object by an amount P along the 
45-degree strip running from a to b and decrease the absorptivity by the same amount P 
along the 135-degree strip running from a’ to b’. Then since the profile integrals add and 


71% 


Source 


(a) (b) (c) 


Figure 4.3-3_ Using the computerized tomography paradigm to show that Gaussian marginal pdf's do 
not imply a joint Gaussian distribution. (a) A projection is the line integral at displacement s and angle 
0. The set of all projections for a given angle is the profile function for that angle. (b) A joint Gaussian 
x-ray object produces Gaussian-shaped profile functions in the horizontal and vertical directions; (c) by 
adding a constant absorptivity along a—b and subtracting an absorptivity along a’—b’, the profile functions 
remain the same but the underlying absorptivity is not Gaussian anymore. 
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subtract P in both horizontal and vertical directions, the net change in f(a) and fo(y) is 
zero. This proves our assertion. We assume that P is not so large that when subtracted 
from f(a,y) along a’—b’, the result is negative. The reason we must make this assumption 
is that pdf’s and x-ray absorptivities can never be negative. 

To illustrate a joint normal distribution consider the following somewhat idealized situ- 
ation. Let X and Y denote the height of the husband and wife, respectively, of a married pair 
picked at random from the population of married people. It is often assumed that X and 
Y are individually Gaussian although this is obviously only an approximation since heights 
are bounded from below by zero and from above by physiological constraints. Conventional 
wisdom has it that in our society tall people prefer tall mates and short people prefer short 
mates. If this is indeed true, then X and Y are positively correlated, that is, p > 0. On the 
other hand, in certain societies it may be fashionable for tall men to marry short women 
and for tall women to marry short men. Again we can expect X and Y to be correlated 
albeit negatively this time, that is, p < 0. Finally, if all marriages are the result of a lottery, 
we would expect p to be zero or very small.! 


*Contours of constant density of the joint Gaussian pdf. It is of interest to determine 
the locus of points in the zy plane when fxy(x,y) is set constant. Clearly fxy (a, y) will 
be constant if the exponent is set to a constant, say, a?: 


(22x) ppt = Hx)Y = Hy) (eo) "ae 


ox OxXOy oy 


This is the equation of an ellipse centered at 7 = wbx, y = fy. For simplicity we set 
[lx = [ty = 0. When p = 0, the major and minor diameters of the ellipse are parallel to the 
x- and y-axes, a condition we know to associate with independence of X and Y. If p = 0 
and ox = oy, the ellipse degenerates into a circle. Several cases are shown in Figure 4.3-4. 

Surprisingly the marginal densities fx(x) and fy(y) computed from the joint pdf of 
Equation 4.3-27 do not depend on the parameter p. To see this we compute 


fx(o)= f Petey 


with py = py = 0 for simplicity. The integration, while somewhat messy, is easily done by 
following these three steps: 


1. Factor out of the integral all terms that do not depend on y; 

2. Complete the squares in the exponent of e (see “completing the square” in 
Appendix A); and 

3. Recall that for b > 0 and real y 


1 ee 1 (y—a\? 
= | exp -3( b )| a= 


In statistics it is quite difficult to observe zero correlation between two random variables, even when in 
theory they would be expected to be uncorrelated. The phenomenon of small, random correlations is used 
by hucksters and others to prove a point, which in reality is not valid. 

*Starred material can be omitted on a first reading. 
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— Sox x 


Soy 


(c) 


Figure 4.3-4 Contours of constant density for the joint normal (X =Y= 0): (a) ox =oy, p=0; 
(b) ox > oy, p = 0; (c) ox < oy, p=0; (d) ox =oy; p> 0. 


Indeed after step 2 we obtain 


fete) = spt exo|-7 (2) ] 


{ate [oe Hegre a. 


But the term in curly brackets is unity. Hence 


i 1/2’ 
fx(z) = ae oo 5 (=) | (4.3-31) 


A similar calculation for fy(y) would furnish 


= Feo | Ae): 


As we stated earlier, if p = 0, then X and Y are independent. On the other hand as p > +1, 


X and Y become linearly dependent. For simplicity let ox = ay So and [ix = py = 0; 
then the contour of constant density becomes 


(4.3-30) 


(4.3-32) 
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x? — 2oxy + y? = co", 


which is a 45-degree tilted ellipse (with respect to the z-axis) for p > 0 and a 135-degree 
tilted ellipse for p < 0. We can generate a coordinate system that is rotated 45 -degrees 
from the « — y system by introducing the coordinate transformation 
== ald Cao, 
v2 v2 


Then the contour of constant density becomes 


v*[1 — p] + w7[1 + p] = 07’, 
which is an ellipse with major and minor diameters parallel to the v and w-axes. If p > 0, 
the major diameter is parallel to the v-axis; if p < 0, the major diameter is parallel to the 
w-axis. As p > +1, the lengths of the major diameters become infinitely long and all of the 
pdf concentrates along the line y = x(p > 1) or y = —a(p > —1). 
Finally by introducing two new RVs 


V2(x4+Y)/v2 
WS (x -Y)/v2, 


we find that as p — 1 


fxy(a,y) = exp | ; (=) x d(y — 2) 


or, equivalently, 


2 
fxy (x,y) seas | ; (2) | x d(y— 2). 
This degeneration of the joint Gaussian into a pdf of only one variable along the line y = x 
is due to the fact that as p —- 1, X and Y become equal. We leave the details as an exercise 
to the student. 
A computer rendition of the joint Gaussian pdf and its contours of constant density is 
shown in Figure 4.3-5 for ux = py = 0, 0x =ox = 2, and p=0.9. 


4.4 CHEBYSHEV AND SCHWARZ INEQUALITIES 


The Chebyshev! inequality furnishes a bound on the probability of how much an RV X can 
deviate from its mean value pry. 


Theorem 4.4-1 (Chebyshev inequality) Let X be an arbitrary RV with mean py 
and finite variance 0”. Then for any 6 > 0 


bo 


Oo 
PIX — xl 28] <5. (4.4-1) 


+Pafnuty L. Chebyshev (1821-1894), Russian mathematician. 
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Figure 4.3-5 (a) Gaussian pdf with X = Y=0, ox = ay =2, and p = 0.9; (b) contours of constant 
density. 
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Proof Equation 4.4-1 follows directly from the following observation: 
oS f (v-X)fx(a) de> | (x — X)2 fx (x) dx 


|o-X|>6 
>of — 
eX 26 


= 8°P||X — X|> dj. 


Since 

{|X — X| > d}U{|X —X|< 56} =Q (QO being the certain event), 
and the two events being unioned are disjoint, it follows that 
o2 
roe 
Sometimes it is convenient to express 6 in terms of a, that is, 6 = ko, where k is a constant.! 
Then Equations 4.4-1 and 4.4-2 become, respectively, 


Pix = xX| <sisi- (4.4-2) 


P(X —X| > ko] < a (4.43) 


— 1 
P||X —X|< ko] >1- oe | (4.4-4) 
Example 4.4-1 


(deviation from the mean for a Normal RV) Let X: N(ux,07). How do P{|X — px| < ko] 
and P[|X — px| > ko] compare with the Chebyshev bound (CB)? 


Solution Using Equations 2.4-14d and 2.4-14e, it is easy to show that P[|X —px| < ko] = 
2erf(k) and P||X — x| > ko] = 1 — 2erf(k), where erf(k) is defined in Equation 2.4-12. 
Using Table 2.4-1 and Equations 4.4-3 and 4.4-4, we obtain Table 4.4-1. 

From Table 4.4-1 we see that the Chebyshev bound is not very good; however, it must 
be recalled that the bound applies to any RV X as long as o? exists. 

There are a number of extensions of the Chebyshev inequality?. We consider such an 
extension in what follows. 


Markov Inequality 


Consider an RV X for which fx (a) = 0 for x <0. Then X is called a nonnegative RV and 
the Markov inequality applies: 


PIX >d]< — (4.4-5) 


In contrast to the Chebyshev bound, which involves both the mean and variance this bound 
involves only the mean of X. 


+The Chebyshev inequality is not very useful when k or 6 is small. 
+See Davenport [4-2, p. 256] 
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Table 4.4-1 

k PI|X — X| < ko] CB P||X — X| > ko] CB 
0 0 0 1 1 
0.5 0.383 0 0.617 1 
1.0 0.683 0 0.317 1 
1.5 0.866 0.556 0.134 0.444 
2.0 0.955 0.750 0.045 0.250 
2.5 0.988 0.840 0.012 0.160 


3.0 0.997 0.889 0.003 0.111 


Proof of Equation 4.4-5 


E[X] = [pO ts(oae > [ stx(oyae > 6 [fete ax 
> 6P[X > 4] 


whence Equation 4.4-5 follows. Equation 4.4-5 puts a bound on what fraction of a population 
can exceed 0. 


Example 4.4-2 
(bad resistors) Assume that in the manufacturing of very low grade electrical 1000-ohm 
resistors the average resistance, as determined by a statistical analysis of measurements, 
is indeed 1000 ohms but there is a large variation about this value. If all resistors over 


1500 ohms are to be discarded, what is the maximum fraction of resistors to meet such a 
fate? 


Solution With x = 1000, and 6 = 1500, we obtain 


1000 
PIX >1 < —— = 0.67. 
[X > 1500] < 1500 0.67 
Thus, if nothing else, the manufacturer has the assurance that the percentage of discarded 


resistors cannot exceed 67 percent of the total. 


The Schwarz Inequality 


We have already encountered the probabilistic form of the Schwarz! inequality in Equation 
4.3-17 repeated here as 


Cov?[X,Y] < E[(X — ux)" EY — py)" 


with equality if and only if Y is a linear function of X. Upon taking the square root of 
both sides of this inequality, we have that the magnitude of covariance between two RVs is 
always upper bounded by the square root of the product of the two variances 


|Cov[X, ¥]| < (0% 02)”. 


+H. Amandus Schwarz (1843-1921), German mathematician. 
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In later work we shall need another version of the Schwarz inequality that is commonly 
used in obtaining results in signal processing and stochastic processes. Consider two 
nonrandom (deterministic) functions h and g not necessarily real valued. Define the norm 


of an ordinary function f by 
6 1/2 
ne ( [fer a) | (4.46) 


—co 


whenever the integral exists. Also define the scalar or inner product of h with g, denoted 
by (h,g), as 


(h, 9) = a h(a)g* (x) dx 


= (g,h)*. (4.4-7) 
The deterministic form of the Schwarz inequality is then 


I(r, 9)] S IAlliigll (4.48) 


with equality if and only if h is proportional to g, that is, h(a) = ag(a) for some constant a. 
The proof of Equation 4.4-8 is obtained by considering the norm of Ah(a)+ (2) as a function 
of the variable 4, 


|Ah(x) + g(a) ||? = JAI? IAI]? + ACA, g) + A* (A, g)* + IIgll? = 0. (4.4-9) 
If we let 
h,g)* 
A= a (4.4-10) 


Equation 4.4-8 follows. In the special case where hf and g are real functions of real RVs, that 
is, h(X), g(X), Equation 4.4-8 still is valid provided that the definitions of norm and inner 
product are modified as follows: 


i [. h?(x) fx (a) dx = E[h?(X)] (4.4-11) 
(ig) © f na)ga) fx (w) ae = BIA X)g(X) (4.4-12) 

whence we obtain 
|B[A(X)g9(X)]] < (Bln? (X))/?(Blg2(X))"?. (4.4-13) 


Law of large numbers. A very important application of Chebyshev’s inequality is to 
prove the so-called weak Law of Large Numbers (LLN) that gives conditions under which a 
sample mean converges to the ensemble mean. 
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Example 4.4-3 


(weak law of large numbers) Let X1,...,Xn be i.i.d. RVs with mean jy and variance o%. 
Assume that we don’t know the value of wx (or ox) and thus consider the sample mean 
estimator' 
A 1 n 
— 


as an estimator for jz. We can use the Chebyshev inequality to show that ji,, is asymptot- 
ically a perfect estimator for of x. First we compute 


Elfin] = = ELK 


Next we compute 


7 1 
Var [ji,,] = 72 Var 


i=l 
1 
2 
= —= | ho 
1 
— =O 
mr 


Thus, by the Chebyshev inequality (Equation 4.4-1) we have 
Pllim — Hx| = S|] < o%/ns?. 


Clearly for any fixed 6 > 0, the right side can be made arbitrarily small by choosing n large 
enough. Thus, 


for every 6 > 0. Note though that for 6 small, we may need n quite large to guarantee that 
the probability of the event {|j1,, — x| > 6} is sufficiently small. This type of convergence 
is called convergence in probability and is treated more extensively in Chapter 8. 

The LLN is the theoretical basis for estimating wx from measurements. When an exper- 
imenter takes the sample mean of n measurements, he is relying on the LLN in order to 
use the sample mean as an estimate of the unknown mathematical expectation (ensemble 
average) E[X] = uy. 


tAn estimator is a function of the observations X1,X2,...,Xn that estimates a parameter of the 
distribution. Estimators are random variables. When an estimator takes on a particular value, that is, a 
realization, that number is sometimes called the estimate. Estimators are discussed in Chapter 6. 
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Sometimes inequalities can be derived from properties of the pdf. We illustrate with 
the following example due to Yongyi Yang. 


Example 4.4-4 
(symmetric RVs) Let the pdf of the real RV X satisfy fx (a) = fx(—x); that is, X is symmet- 
rically distributed around zero. Show that ox > E||X|] with equality if Var(|X|) =0. 


Solution Let Y © |X|. Then E[Y?] = E[X?2] = ux +0% = 0% since ux = 0. Also 
E[Y?] = np? +0} = E?||X|] +0} = 0%. But of} > 0. Hence E?||X|] < 0% with equality if 
a} = 0. Such a case arises when the pdf of X has the form fx(x) = $[6(x — a) + 6(x +a)], 
where a is some positive number. Then Y = a, cy = 0, and E||X|| =ox. 


Another inequality is furnished by the Chernoff bound. We discuss this bound in Section 
4.6 after introducing the moment-generating function M(t) in the next section. 


4.5 MOMENT-GENERATING FUNCTIONS 


The moment-generating function (MGF), if it exists, of an RV X is defined by? 


M(t) 2 Ble’*] (4.5-1) 
= [- e' fx (2) dx, (4.5-2) 


where t is a complex variable. 
For discrete RVs, we can define M(t) using the PMF as 


Mij= Se Peay): (4.5-3) 


a 


From Equation 4.5-2 we see that except for a sign reversal in the exponent, the MGF is the 
two-sided Laplace transform of the pdf for which there is a known inversion formula. Thus, 
in general, knowing M(t) is equivalent to knowing fx (a) and vice versa. 

The main reasons for introducing M(t) are (1) it enables a convenient computation of 
the moments of X; (2) it can be used to estimate fx(x) from experimental measurements 
of the moments; (3) it can be used to solve problems involving the computation of the sums 
of RVs; and (4) it is an important analytical instrument that can be used to demonstrate 
basic results such as the Central Limit Theorem.* 


+The terminology varies (see Feller [4-1], p. 411). 
*To be discussed in Section 4.7. 
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Proceeding formally, if we expand e’* and take expectations, then 
g ; 


tx)? tx)” 
Ele’~|=E 1+ex +! ) + +! ) 
2! n! 
t? t” 
=l+tm+xamet+...+—mnt.... (4.5-4) 
2! n! 


Since the moments m; may not exist, for example, none of the moments above the first 
exist for the Cauchy pdf, M(t) may not exist. However, if M(t) does exist, computing any 
moment is easily obtained by differentiation. Indeed, if we allow the notation 


then 
my =M (0) k=0,1,.... (4.5-5) 


Example 4.5-1 
(Gaussian MGF) Let X : N(,07). Its MGF is then given as 


Mx(t) = = i: ae (-4 (- - “)') ef da, (4.5-6) 


Using the procedure known as “completing the square”! in the exponent, we can write 
Equation 4.5-6 as 


Mx(t) = exp[yt + ot? /2] 


1 eS 1 
x == / exp (-sale = (u+ ot) dx. 


But the factor on the second line is unity since it is the integral of a Gaussian pdf with 
mean p+ o7t and variance a”. Hence the Gaussian MGF is 


Mx(t) = exp(yt + o7t?/2), (4.5-7) 


from which we obtain 


+See “Completing the square” in Appendix A. 
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Example 4.5-2 
(MGE of binomial) Let B be a binomial RV with parameters n (number of tries), p (prob- 
ability of a success per trial), and g = 1 — p. Then the MGF is given as 


k=0 
= , @ [e‘p]*qr—* 
= (pe! +). (4.58) 
We obtain 
Mi? (0) = np = p 
Mg) (0) = {npet(pet + 4)"-! + n(n — 1)p?e"*(pet + g)"-? Fr=0 (4.5-9) 
= npqt p’. 
Hence 
Var|B] = npg. (4.5-10) 


Example 4.5-3 
(MGF of geometric distribution) Let X follow the geometric distribution. Then the PMF 
is Px(n)= a"(1— a)u(n),n = 0,1,2,... and 0 <a <1. The MGF is computed as 


Mx(t)= 5 >(1-a)a"e™ 
n=0 
=(is a) S(aet)" a 
ai 1—aet 


Then the mean pz is computed from p = Mi.(0) = (1 — a)(1 — ae’)~?ae~*|4-9 = a/(1 — a). 


We make the observation that if all the moments exist and are known, then M(t) is 
known as well (see Equations 4.5-4 and 4.5-2). Since Mx(t) is related to fx (x) through the 
Laplace transform, we can, in principle at least, determine fx (a) from its moments if they 
exist.! In practice, if X is the RV whose pdf is desired and X; represents our ith observation 
of X, then we can estimate the rth moment of X, m,, from 


eee 
Mp = — XT 4.5-11 
m =m : (4.5-11) 


+ For some distributions not all moments exist. For example, as stated earlier for the Cauchy distribution, 
all moments above the first do not exist. 
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where ™,. is called the r-moment estimator and is an RV, and n is the number of obser- 
vations. Even though m,. is an RV, its variance becomes small as n becomes large. So for 
n large enough, we can have confidence that ™,. is reasonably close to m, (a deterministic 
quantity, that is, not an RV). 

The joint MGF Mxy(t1,t2) of two RVs X and Y is defined by 


Mxy (ti, ta) & BielitteY)) 
= : / exp(ti2 + toy) fxy (x, y) dx dy. (4.5-12) 


Proceeding as we did in Equation 4.5-4, we can establish with the help of a power series 
expansion that 


i 
Mxy (ti, te) a5 ut™ (4.5-13) 
i=0 7=0 
where mj; is defined in Equation 4.3-13. Using the notation 


A 0+" Mxy (ti, te) 


I,n 
MV" (0.0) = aioe 
er) 


t1=t2=0, 


we can show from Equation 4.5-12 or 4.5-13 that 


min = MY (0,0). (4.5-14) 
In particular 
MY (0,0) = Ex, M2) (0,0) = py (4.5-15) 
MY) (0,0) = E[X?],  MX¥(0,0) = E[Y?] (4.5-16) 
Myy (0, 0) = mi = Cov[X, Y] + px by. (45-17) 


4.6 CHERNOFF BOUND 


The Chernoff bound furnishes an upper bound on the tail probability PX > a], where a is 
some prescribed constant. First note that u(x — a) < e*- for any t > 0. Assume that X 


is a continuous RV. Then 
P[X > al =| fx (x) dx 


a fx(x)u(x — a) dx (4.6-1) 
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and, by the observation made above, it follows that 
P[X >a] < Ie fx(a)et®—-% dar (4.6-2) 
and this must hold for any t > 0. But, from Equation 4.5-2, 
/ _ fx(x)e*?-% dx = e~“* M x(t), (4.6-3) 


where the subscript has been added to emphasize that the MGF is associated with X. 
Combining Equations 4.6-3 and 4.6-2, we obtain 


P[X >a] < e ™Mx(t). (4.6-4) 
The tightest bound, which occurs when the right-hand side is minimized with respect to ft, 
is called the Chernoff bound. We illustrate with some examples. 


Example 4.6-1 
(Chernoff bound to Gaussian) Let X : N(,07) and consider the Chernoff bound on P[X > 
a], where a > yw. From Equations 4.5-7 and 4.6-3 we obtain 


P[X > a < eT (ao H)tto7t? /2. 


The minimum of the right-hand side is obtained by differentiating with respect to ¢ and 
occurs when t = (a — )/o?. Hence the Chernoff bound is 


PIX > a] < e- @-#)" 20”, (4.6-5) 


The Chernoff bound can be derived for discrete RVs also. For example, assume that 
an RV X takes values X = i, i = 0,1,2,..., with probabilities PLX = ¢] 7 Px (i). For any 
integers n, k, define 

1, n>k, 
ol a { 0, otherwise. 


If follows, therefore, that 


PLX > &] = > Px(n) 
n=k 


= S5 Px(n)u(n — k) 
n=0 
< ye Px(n)jet(r—*) fort > 0. 
n=0 


The last line follows from the fact that 


e-*) > a(n — k) for t > 0. 
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We note that 
3 Px (n)et("-*) = e-** a Px(n)e™ 
n=0 n=0 
=e *Mx(t) (by Equation 4.5-3). 
Hence we establish the result 
P[X > k] < e"*Mx(t). (4.6-6) 


As before, the Chernoff bound is determined by minimizing the right-hand side of Equation 
4.6-6. We illustrate with an example. 


Example 4.6-2 
(Chernoff bound for Poisson) Let X be a Poisson RV with parameter a > 0. Compute the 
Chernoff bound for Px(k), where k > a. From homework problem 4.39 we find the MGF 


and 


By setting 


we find that the minimum is reached when t = t,,, where 


tm =ln-. 
a 


Thus with a = 2 and k = 5, we find 
P[X > 5] < e~7 exp[5 — 5In(5/2)] 
< 0.2. 


4.7 CHARACTERISTIC FUNCTIONS 


If in Equation 4.5-1 we replace the parameter t by jw, where j = V—1, we obtain the 
characteristic function (CF) of X defined by 


A E{ei’* | 


I. fx(a)e3”* dx, (4.7-1) 


® x (w) 
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which, except for a minus sign difference in the exponent, we recognize as the Fourier 
transform of fx(x). For discrete RVs we can define ®x(w) in terms of the PMF by 


Dx(w) =) e' Px(e,). (4.7-2) 


a 


For our purposes, the CF has all the properties of the MGF. The Fourier transform is widely 
used in statistical communication theory, and since the inversion of Equation 4.7-1 is often 
easy to achieve, either by direct integration or through the availability of extensive tables 
of Fourier transforms (e.g., [4-7]), the CF is widely used to solve problems involving the 
computation of the sums of independent RVs. We have seen that the pdf of the sum of 
independent RVs involves the convolution of their pdf’s. Thus if Z = X;+...+Xwy, where 
X;,1=1,...,N, are independent RVs, the pdf of Z is furnished by 


fa(z) = fx, (2) * fixe (z) *---* fxn (2); (4.7-3) 


that is, the repeated convolution product. 

The actual evaluation of Equation 4.7-3 can be tedious. However, we know from our 
studies of Fourier transforms that the Fourier transform of a convolution product is the 
product of the individual transforms. We illustrate the use of CFs in the following examples. 


Example 4.7-1 
(CF of sum) Let Z 4X1 + X. with fx, (x), fx.(x), and fz(z) denoting the pdf’s of X, 
X 2, and Z, respectively. Show that ®z(w) = ®x,(w)®x,(w). 


Solution From the main result of Section 3.3 (Equation 3.3-15), we have 


fale) = f ” f@ine- aa 


and the corresponding CF 


b2(u) = fe | [flo )fvele—2) da] de 


= a 7 fal / . fx, (2 — 2)e*de dz. 


With a change of variable a Sz x, we obtain the CF of the sum Z as 
Oz(w) = Ox, (w)®x, (wv). 


This result can be extended to N RVs by induction. Thus if 7 = X,; +---+ Xn, then the 
CF of Z would be 


Dz = Px, (w)®x, (w) ... Px, (w). 
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Example 4.7-2 
(CF of i.i.d. sum) Let X;,i = 1,...,.N, be a sequence of i.i.d. RVs with X : N(0,1). Compute 
the pdf of 


N 


ZES°X. 


i=l 


Solution The pdf of Z can be computed by Equation 4.7-3. On the other hand, with 
®x,(w) denoting the CF of X;, we have 


Bz(w) = Bx, (w) x... x Bxy(w). (4.7-4) 


However, since the X;’s are i.i.d. N(0,1), the CFs of all the X;s are the same, and we define 
A 
@x(w) = Ox, (w) =... = @x, (w). Thus, 


@x(w) = / eee dx. (4.7-5) 


By completing the squares in the exponent, we obtain 


— 3[2?-2jwat(jw)?—(jw)"] qr 


a Nay / 1 Be-ju)? da. 
= J2n 


But the integral can be regarded as the area under a “Gaussian pdf” with “mean” jw and 
hence its value is unity’. Thus we obtain the CF of X as 


re 
@x(w) =e 7 
and so the CF of Z is 


®z(w) = [6x(w)|" =e72™. (4.7-6) 


From the form of @z(w) we deduce that fz(z) must also be Gaussian. To obtain fz(z) we 
use the Fourier inversion formula: 


fz(z) ! a @z(w)e 9"? dw. (4.7-7) 


on oa 


Inserting Equation 4.7-6 into Equation 4.7-7 and manipulating terms enables us to obtain 


fa(z) = me HO), 


27n 


Hence fz(z) is indeed Gaussian. The variance of Z is n, and its mean is zero. 


+While this result is not obvious, it can be be rigorously demonstrated using integration in the complex 
plane. 
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Example 4.7-3 
(CF of sum of uniform RVs) Consider two independent RVs X and Y with common pdf 


fx(x) = fy(z) = rect (“) : 


Compute the pdf of Z = eay using CFs. 


Solution We can, of course, compute fz(z) by convolving fx(x) and fy(y). However, 
using CFs, we obtain fz(z) from 


fz(z) s [- x (w) By (w)eI** dw, 


where 
Px(w) Py (w) = Pz(w). 


Since the pdf’s of X and Y are the same, we can write 


®(w) 2 Ox (w) = Sy(w) 


Al a/2 
= -/ e)** dx 
a —a/2 


a sin(aw/2) 
aw /2 
Hence 
2 (w) = (See) (4.7-8) 
and 


=: (1 : 2) rect (=), (4.7-9) 


which is shown in Figure 4.7-1. The easiest way to obtain the result in Equation 4.7-9 is 
to look up the Fourier transform (or its inverse) of Equation 4.7-8 in a table of elementary 
Fourier transforms. 


As in the case of MGF’s, we can compute the moments from the CFs by differentia- 
tion, provided that these exist. If we expand exp(jwX) into a power series and take the 
expectation, we obtain 
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f(z) 


—a 0 a 4 


Figure 4.7-1 The pdf of Z = X+ Y when X and Y are independently, identically, and uniformly 
distributed in (—a/2, a/2). 


}x(w) = Ble] 


_ se (ju) (4.7-10) 


From Equation 4.7-10 it is easily established that 
1 


Mn = —&Y? (0), (4.7-11) 
J 
where we have used the notation 
(n);p, A d” 
® = ——®©® 
PO) S Tree] 


Example 4.7-4 
(moment calculation) Compute the first few moments of Y = sin O if ©: U[0, 27]. 


Solution We use the result in Equation 4.1-9; that is, if Y = g(X), then 


by = /- yfy (y)dy = ia g(x) fx (x) dx. 


Hence 


E[e*** | =| " el¥Y fy (y)dy 


1 27 a 
ese ej¥ sin 9d@ 


= Jo(w), 


where Jo(w) is the Bessel function of the first kind of order zero. A power series expansion 


of Jo(w) gives 7 1 4 
Jo(w) =1 G) Y 3131 (5) 
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Hence all the odd-order moments are zero. From Equation 4.7-11 we compute 


2 = (-1)0)(0) = 5 


E[Y?]=m 

ELY4] = mg = (41)89(0) = 3. 
Example 4.7-5 
(sum of independent binomials) Let X and Y be i.i.d. binomial RVs with parameters n and 
p, that is, 


n ress 
Px(k) = Pr(k) = (7) at 
Compute the PMF of Z7=X+Y. 


Solution Since X and Y take on nonnegative integer values, so must Z. We can solve 
this problem by (1) convolution of the pdf’s, which involves delta functions; (2) discrete 
convolution of the PMFs; and (3) CFs. The discrete convolution for this case is 


A= n n 
=p'g a) hace) for k =0,1,...,2n. 


The trouble is that we may not immediately recognize the closed form of the sum of products 
of binomial coefficients.’ The computation of the PMF of Z by CFs is very simple. First 
observe that 


Thus, by virtue of the independence of X and Y, we obtain the CF 
Bz(w) = Elexp ju(X + ¥)] 
= Elexp(jwX)]Elexp(jwY)] 
= (pe + q)?". 


Thus Z is binomial with parameters 2n and p, that is, 


Pz(k) = (7) pig’, fork =0,...,2n. 


tRecall that we ran into this problem in Example 3.3-9 in Chapter 3. 
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As a by-product of the computation of Pz(k) by CFs, we obtain the result that 


es) 


An extension of this result is the following: If X,,X2,...,Xy are iid. binomials with 
parameters n, p, then Z = eur X;, is binomial with parameters Nn, p. Regardless of how 
large N gets, Z remains a discrete RV with a binomial PMF.t 


Example 4.7-6 
(variance of Poisson) Here we calculate the CF of a Poisson RV and use it to determine 
the variance. Let the RV K be Poisson distributed with PMF 


ak 
PK(k) = are ule), a>0 
Then the CF is given as 
Ox (w) = a ak 
k=0 
So (ae) a 
as a 
k=0 


Now m2 = E[K?] = 46°) (0) - —6°) (0). Taking the indicated derivatives, we get 


0) (w) = @xK(w)ajets” 


and 
0°) (w) = Ox (w)aj7ets? + oY (wyajeti 
= —Ox(w)aeT™” + Ox (w) (ajeti#)? ; 
So 6°) (0) =-1xa-—1xa?. Hence pp =a+a?. Then since the mean is y =a, the 


variance must be 


The variance of the Poisson RV thus equals its mean value. 


tRecall this statement for future reference in connection with the Central Limit Theorem. 
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Note that since the variance of the Poisson RV equals its mean, the standard deviation 
is the square root of the mean. Therefore for large mean values, the distribution becomes 
relatively concentrated around the mean. Another point is that unlike the Normal distri- 
bution the mean and variance of the Poisson RV are not independent parameters i,e., they 
cannot be freely chosen. 


Example 4.7-7 
(a fair game?) A lottery game called “three players for six hits” is played as follows. A bettor 
bets the bank that three baseball players of the bettor’s choosing will get a combined total 
of six hits or more in the games in which they play. Many combinations can lead to a win; 
for example, player A can go hitless in his game, but player B can collect three hits in his 
game, and player C can collect three hits in his game. The players can be on the same team 
or on different teams. The bet is at even odds and the bettor receives back $2 on a bet of $1 
in case of a win. Is this a “fair” game, that is, is the probability of a win close to one-half? 


Solution Let X1, X2, X3 denote the number of hits by players A, B, C, respectively. 
Clearly X1, X2, X3 are individually binomial. The total number of hits is Y = = Xj. 
We wish to compute P[Y > 6]. To simplify the problem, assume that each player bats five 
times per game, and their batting averages are the same, say 300 (for those unfamiliar with 
baseball nomenclature, this means that the probability of getting a hit while batting is 0.3). 
Then from the results of Example 4.7-5, we find Y is binomial with parameters n = 15, 
p = 0.3. Thus, 


7=15 
PY >6)= >" e (CE) ONG ame 


a 
i=6 
= erf(6.76) — erf(0.56) 
= 0.29. 


In arriving at this result, we used the Normal approximation to the binomial as suggested 
in Chapter 1, Section 1.11. The bettor has less than a one-third chance of winning. Despite 
the poor odds, the game can be modified to be fairer to the bettor. Define the RV G as the 
gain to the bettor and define a fair game as one in which the expected gain is zero. Then 
if the bettor were to receive winnings of $2.45 per play instead of $1, we would find that 
E|G] = $2.45 x 0.29 —$1 x 0.71 = 0. Of course if E[G] > 0, then in a sense, the game favors 
the bettor. Some people play the state lottery using this criterion. 


Joint Characteristic Functions 


As in the case of joint MGF's we can define the joint CF by 


N 
®x, xy(w1,W2,...,WnN) =E [ex (4)] (4.7-12) 
w=1 


By the Fourier inversion property, the joint pdf is the inverse Fourier transform (with a sign 
reversal) of ®x,. x, (w1,...,wn). Thus, 
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1 Co Co 
fx xy (B1y-0-4 EN) = vaull | ®x,...Xy(W1,---, Wn) 


N 
x exp (-: See dw, dw2...dwy. (4.7-13) 


i=1 


We can obtain the moments by differentiation. For instance, with X, Y denoting any two 
RVs (N = 2) we have 


mre = B[XTY*] = (—7) OLY (0,0), (4.7-14) 
where 


a Ot@xy(w1, we) 
dwt Owk 


6°*)(0,0) = (4.7-15) 


W1=wW2=0 


Finally for discrete RVs we can define the joint CF in terms of the joint PMF. For instance 
for two RVs, X and Y, we obtain 


® xy (wi, 2) = 2a? (wititways) Pyy(x,, yj). (4.7-16) 


Example 4.7-8 
(joint CF of 1.i.d. Normal RVs) Compute the joint characteristic function of X and Y if 


fxy = > exp | (x? 4 | . 
Tv 


Solution Applying the definition in Equation 4.7-12, we get 


Pxy(wi,w2) = =|. [ie 4 (a+?) .jwietjwoy Sea: 


Completing the squares in both x and y, we get 


2) dx 


co 
Oxy (w1,wW2) = eee | oe” 2[e?—2jure+ (jury 
_ V20 


. / So Bly? —2iwayt (dw)? FY 
= Jor 


= en 2 Wits) a en 2 (t—jui)? a i e7 2 (y—jwe)? dy 
= /27 J—oo V2 


aft 2 2 
— p-ax(w{tw 
=e 3 ( al 2) 


>) 


since the integrals are the areas under unit-variance Gaussian curves. 
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Example 4.7-9 
(joint CF of two discrete RVs) Compute the joint CF of the discrete RVs X and Y if the 


joint PMF is 


k=1=0, 
k= +1,1=0, 
c= S44, 


Pxy(k,l) = 


SO alk OF wile 


else. 


Solution Using Equation 4.7-16 we obtain 


1 1 
®xy(w1,w2) = > S- eflwiktwal) Py (k, 1) 
k=—-11=-1 


= 5 +E cosu + 5 008(w1 + wz) 
==+- = CO 2). 
3 aed 1 Fe 1 2 


From Equations 4.7-14 and 4.7-15 we obtain, since ux = fy = 0, 


i 
o% © ma = —(—J)?[eoswi + cos(wi + w2)] = 
W1=w2=0 
= 2. 
=e 
2 A 2 1 
Oy = Mo = —(—J)” = cos(w, + wa) 
3 W1=Ww2=0 
— 1. 
5) 
gl 
miu = —(—J) 3 cos(w1 + w2) 
W1=Ww9=0 
il 
= 3 


Hence the correlation coefficient p is computed to be 


1 
, oxoy V2y3 J2 
3V 3 
Example 4.7-10 


(joint CF of correlated Normal RVs) As another example we compute the joint CF of X 
and Y with 


= 0.707. 


x? + y* — seat 


1 
fxy(2,y) = On = exp ( 21 — p?) 
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To solve this problem we use two facts: 
(1) A zero-mean Gaussian RV Z with variance 07, has CF 


. 1 
Elei”2] = exp [-502-"| (4.7-17) 
and, in particular, with w = 1, 
jZ 1 2 

Ele’“] = exp 572 - (4.7-18) 

_ : 2)\-1/2 1 2? 

Proof of fact (1) Use the definition of the CF with fz(z) = (2707) exp | —5 a 

Zz 


and apply the complete-the-square technique described in Appendix A. 
(2) If X and Y are zero-mean jointly Gaussian RVs, then for any real w 1, w2, the RVs 


Zw X +u2Y 
wx 
are jointly Gaussian and, as a direct by-product, the marginal density of Z is Gaussian. 


Proof of fact (2) Simply use Equation 3.4-11 or 3.4-12 to compute fzw(z,w). One 
easily finds that Z, W are jointly Gaussian and that, therefore, the marginal pdf of Z alone 
is Gaussian with Z = 0. The variance of Z is computed as 


Var(Z) = E[(wiX + weY)?] 
= wiVar[X] + wVar[Y] + 2wiwoXY. 


With 02, = a2, & 1, we obtain 02, = w? + w3 + Qwiwep. 
Finally recalling that Z = w,;X +weY and using Equation 4.7-18, we write 


EleiwiX+wa¥)) = en 3 wi tw t2w1w2p) (4.7-19) 


Equation 4.7-19 is the joint CF of two zero-mean, unity variance correlated Gaussian RVs. 
When p = 0, the RVs become uncorrelated and therefore independent and we obtain the 
result in Example 4.7-8. 

The extension to more than two discrete RVs is straightforward, although the notation 
becomes a little clumsy, unless matrices are introduced. 


The Central Limit Theorem 


It is sometimes said that the sum of a large number of RVs tends toward the Normal. Under 
what conditions is this true? The Central Limit Theorem deals with this important point. 
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Basically the Central Limit Theorem! says that the normalized sum of a large number 
of mutually independent RVs X,,...,X» with zero means and finite variances o7,...,07 
tends to the Normal CDF provided that the individual variances 07, k = 1,...,n, are 
small compared to )>;_, 7?. The constraints on the variances are known as the Lindeberg 
conditions and are discussed in detail by Feller [4-1, p. 262]. We state a general form of the 


Central Limit Theorem in the following and furnish a proof for a special case. 


Theorem 4.7-1 Let X1,...,X,, be n mutually independent (scalar) RVs with CDFs 
Fy, (1), Fx,(2),..., Fx, (@n), respectively, such that 


Lx, =9, Var[X,] = 0% 


and let 
2A 
= 


2 2 
§, =O; +...+¢0). 


If for a given e > 0 and n sufficiently large the o, satisfy 
Ok < €8n, K=1.agn, 


then the normalized sum 
Zn 2 (Xi tb... +Xn)/Sn 
converges to the standard Normal CDF, denoted by 1/2+ erf(z), that is, limp+o Fz, (z) = 


1/2 +erf(z). This is called convergence in distribution. ‘ 


A discussion of convergence in distribution is given later in this section. 
We now prove a special case of the foregoing. 


Theorem 4.7-2 Let X1, X2,...,Xn be iid. RVs with wy; = 0, and Var[X;] = 1, 
4=1,...,n. Then 


tends to the Normal in the sense that its CF @z,, satisfies 
1 w 


lim ®z,(w)=e 2”, 


n—oo 


which is the CF of the N(0,1) RV. 


Proof Let W; 2 X;//n. Also, let ®x,(w) and fx,(%) be the CF and pdf of X, 
respectively. Then 


tFirst proved by Abraham De Moivre in 1733 for the special case of Bernoulli RVs. A more general 
proof was furnished by J. W. Lindeberg in Mathematische Zeitschrift, vol. 15 (1922), pp. 211-225. 
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by, 2 Ele] 


= Efeie/v) Xs) 


-«(3) 


Since ®x,(w) and ®y,(w) do not depend on i, we write ® x, (w) S @®x(w) and Oy, (w) & 
®yw(w). From calculus we know that any function ®(w) whose derivative exists in a neigh- 
borhood about wo can be represented by a Taylor series 


Dw) = > FH (wo)(w — wo)! 
1=0 


where ®( (9) is the Ith derivative of ®(w) at wo. Moreover, if the derivatives are continuous 
in the interval [wo, w], ®(w) can be expressed as a finite Taylor series plus a remainder Az (w), 
that is, 


!(wo)(w — wo)! + AL(w), 


where 


and € is some point in the interval [wo,w]. Let us apply this result to ®w(w) with wo = 0. 
Then 


(0) =1 
eO)= [ie fe(a)de]_ =0 
—oco w=0 
(2) ot ee ont 1 
6?)(0) =| Zeit" Fx (a)de] = 2 
aba w=0 


Hence 


where = 
Rite / xei€e/VF fy (2) dx/6. 


Since Z, = )7;_, Wi, we obtain 
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or 
In®z (w) =nln &y(w). 


Now recall that for any h such that |h 


<1, 


Assuming this to have been done, we can write 


we ty 
In®z (w) =nin f ag + =| 


oe, Bm Ife, BY, 1 fe, BY, 
—" Qn m/n 2 Qn n/n "8 In n/n 


2 
WwW : : 2 = a 
ae + terms involving factors of n 1/2 n in a2 sons 


Hence 


lim [In@®z, (w)] = > 
or, equivalently, 
lim @z,(w) =e“ /?, 


n—-oo 


which is the CF of the N(0,1) RV. Note that to argue that limy. fz, (z) is the normal 
pdf we should have to argue that 


lim ®z (w) 2 lim ( / fa, (2)0""de 
=u) (Jim fz, (2)) el”? dz. 


nN— Co 


However, the operations of limiting and integrating are not always interchangeable. Hence 
we cannot say that the pdf of Z, converges to N(0,1). Indeed we already know from 
Example 4.7-5 that the sum of n i.i.d. binomial RVs is binomial regardless of how large n 
is; moreover, the binomial PMF or pdf is a discontinuous function while the Gaussian is 
continuous and no matter how large n is, this fact cannot be altered. However, the integrals 
of the binomial pdf, for large n, behave like integrals of the Gaussian pdf. This is why the 
distribution function of Z, tends to a Gaussian distribution function but not necessarily to 
a Gaussian pdf. 
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The astute reader will have noticed that in the prior development we showed the normal 
convergence of the CF but not as yet the normal convergence of the CDF. To prove the 
latter true we can use a continuity theorem! which states the following: Consider a sequence 
of RVs Z;,i = 1,...,n, with CFs and CDFs ®;(w) and F;(z),i = 1,...,n, respectively, with 
P(w) 9 limpsoo On (w) and ®(w) continuous at w = 0; then F(z) = limp. Fn(z). 


Example 4.7-11 
(application of the Central Limit Theorem [CLT]) Let X;,i = 1,...,n, be a sequence of 


iid. RVs with E[X;] = wy and Var[X;] = 0%. Let Y 4 yoy, Xi where n is large. We wish 
to compute Pla < Y <b] using the CLT. With Z 2 (Y — E[Y])/oy, and cy > 0, 


Pla<Y <t)=Pla’ <Z <8", 


where 
— BY 
a A a [ ] 
Oy 
oy 
and 


oy = Vnox. 


Note that Z is a zero-mean, unity variance RV involving the sum of a large number (n 
assumed large) of i.i.d. RVs. Indeed with some minor manipulations we can write Z as 


n 


1 Rs) 
Ws my. 
Ty Oa 


Hence ; 
Pla’ <Z7< 0 : [ewe 
a ~~ — e€ Zz. 
a V2T Jal 


Although the CLT might be more appropriately called the “Normal convergence theorem,” 
the word central in Central Limit Theorem is useful as a reminder that CDF's converge to 
the normal CDF around the center, that is, around the mean. Although all CDF's converge 
together at +oo, it is in fact in the tails that the CLT frequently gives the poorest estimates 
of the correct probabilities, if these are small. An illustration of this phenomenon is given 
in Problem 4.59. 

In a type of computer-based engineering analysis called Monte-Carlo simulation, it 
is often necessary to have access to random numbers. There are several random number 
generators available in software that generate numbers that appear random but in fact are 
not: They are generated using an algorithm that is completely deterministic and therefore 


+See Feller [4-1, p. 508}. 
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they can be duplicated by anyone who has a copy of the algorithm. The numbers, called 
pseudo-random numbers, are often adequate for situations where not too many random 
numbers are needed. For situations where a very large number of random numbers are 
needed, for example, modeling atomic processes, it turns out that it is difficult to find an 
adequate random number generator. Most will eventually display number sequences that 
repeat, that is, are periodic, are highly correlated, or show other biases. Note that the 
alternative, that is, using a naturally random process such as the emission of photons from 
x-ray sources or the liberation of photoelectrons from valence bands in photodetectors, also 
suffers from a major problem: We cannot be certain what underlying probability law is truly 
at work. And even if we knew what law was at work, the very act of counting photons or 
photoelectrons might bias the distribution of random numbers. 

In any case, if we assume that for our purposes the uniform random number generators 
(URNG) commonly available with most PC software packages are adequate in that they 
create unbiased realizations of a uniform RV X, the next question is how can we convert 
uniform random numbers, that is, those that are assumed to obey the uniform pdf in (0, 1), 
to Gaussian random numbers. For this purpose we can use the CLT as follows. Let X; 
represent the 7th random number generated by the URNG. Then 


Z=X,+...+Xn 


will be approximately Gaussian for a reasonably large n (say >10). Note that the pdf of 
Z is the n-repeated convolution of a unit pulse which starts to look like a Gaussian very 
quickly everywhere except in the tails. The reason there is a problem in the tails is that Z 
is confined to the range 0 < Z < n while if Z were a true Gaussian RV, then —o0o < Z < oo. 


4.8 ADDITIONAL EXAMPLES 


Example 4.8-1 
Let X;,i = 1,...,n, be n i.i.d. Bernoulli RVs with individual PMF: 


pr(1 —p)'*, i= 0, 1 
0, else. 


Px, (x) = { 


Show that Z = > X; is binomial with PMF 6(k; n, p) 2 @ pegr-*. 


i=l 


1 
Solution The CF of the Bernoulli RV is computed as &x,(w) = So e**p™(1 — p)t-* = 
«=0 


peJ” + q, where q = 1— p. From Equation 4.7-4, we obtain that 
&7(w) = || (we +1) =(pe™ +1)”, 
i=1 


which, from Example 4.7-5, we recognize as the CF of the binomial RV with PMF as above. 
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Example 4.8-2 
Let Z be a binomial RV with PMF b(k; n, p) = ‘ p*q"—*, where n > 1, and consider 


the event {a < Z < b}, where a and b are numbers. Use the CLT to compute Pla < Z < DJ. 


Solution From Example 4.8-1 we know Z can be resolved as a sum of n i.i.d. Bernoulli 
RVs. Thus we write Z = X, +--:X,, where E[Z] = np and Var [Z] =npq > pq when n 
is large. The situation is now ripe for applying Theorem 4.7-1, the Central Limit Theorem. 


ae : eee a—np Z—np b—np : , A 
The event {a < Z < }} is identical to the event {SE < SE < S}. With a’ = 


a—np 1A b—np pA Z—np : / / / ae 
apa? = ea! and Z! = Japa the event can be rewritten as {a’ < Z’ < b'}, where Z’ is 
a zero-mean, unit-variance RV. Then from Example 4.7-11, which uses a formula based on 
the CLT, we get 


b’ 
1 1 
Pla<Z<be sae f exvl-5 eee 


In terms of the standard Normal distribution, Fsn(a), defined in Equation 1.11-3, this 
result can be written as 


b—np a—np 
<Z4<0 : 
ree oo Fox | ae Fox || 


The correction factor of 0.5 in the limits in Equation 1.11-5 is insignificant when n > 1. 


Example 4.8-3 

Let Z be a binomial RV with mean np and standard deviation \/npqg. Use the Normal 
approximation furnished by the CLT to compute the probability of the following events: 
{np — Jpg < Z < np + npg}, {np — 2,/npq < Z < np +2/npq}, {np —3/npq < Z < 
np + 3,/npq 3 


Solution With the change of variable Z’ = aoe the three events are converted to 


{-1< 2’ <1},{-2< 7 <2}, {-3 < Z' <3}. The RV Z’ is zero-mean, unit variance and 
the Normal approximation furnished by the CLT yields: 


2] = Fsn(2) = Fsn(—2) x 0.954 


| 
INIA | 


Note that the last-listed event is (almost) certain to occur. In a thousand repetitions it 
will on the average fail to occur only three times. 
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Example 4.8-4 


Let X;,i = 1,...,100, be iid. Poisson RVs with PMF P[k] = eek = 0,1,.2,.5. 
100 
Here k is the number of events in a given interval of time. Let Z = )> X;. We note that 


i=1 
E|Z| = 200 and Var[Z]=200. This situation might reflect the summed data packets collected 
at a receiver from identical multiple channels. Use the CLT to compute the probability of 
the event {190 < Z < 210}. 


Solution Since Z is the sum of a large number of i.i.d. RVs and the variance of any of these 
is much smaller than the variance of the sum, the CLT permits us to use the Normal approx- 
imation to compute the probability of this event. Define the RV Z’ = re which is zero- 
mean and unity variance. Then, in terms of Z’, the event becomes {—0.707 < Z’ < 0.707}. 


The Normal approximation yields F's (0.707) — F's~(—0.707) ~ 0.52. 


SUMMARY 


In this chapter we discussed the various averages of one or more random variables (RVs) and 
the implication of those averages. We began by defining the average or expected value of an 
RV X and then showed that the expected value of Y = g(X) could be computed directly 
from the pdf or PMF of X. We briefly discussed the important notion of conditional expec- 
tation and showed how the expected value of an RV could be advantageously computed by 
averaging over its conditional expectation. We then argued that a single summary number 
such as the average value, ux, of X was insufficient for describing the behavior of X. This 
led to the introduction of moments, that is, the average of powers of X. We illustrated how 
moments can be used to estimate pdf’s by the maximum entropy principle and introduced 
the concept of joint moments. We showed how the covariance of two RVs could be inter- 
preted as a measure of how well we can predict one RV from observing another using a 
linear predictor model. By giving a counterexample, we demonstrated that uncorrelated- 
ness does not imply independence of two RVs, the latter being a stronger condition. The 
joint Gaussian pdf for two RVs was discussed, and it was shown that in the Gaussian case, 
independence and uncorrelatedness are equivalent. We then introduced the reader to some 
important bounds and inequalities known as the Chebyshev and Schwarz inequalities and 
the Chernoff bound and illustrated how these are used in problems in probability. 

The second half of the chapter dealt mostly with moment generating functions (MGFs) 
and characteristic functions (CFs) and the Central Limit Theorem (CLT). We showed how 
the MGF and CF are essentially the Laplace and Fourier transforms, respectively, of the 
pdf of an RV and how we could compute all the moments, provided that these exist, from 
either of these functions. Several properties of these important functions were explored. We 
illustrated how the CF could be used to solve problems involving the computation of the 
pdf’s of the sums of RVs. 

We then discussed the CLT, one of the most important results in probability theory, and 
the basis for the ubiquitous Normal behavior of many random phenomena. The CLT states 
that under relatively loose mathematical constraints, the cumulative distribution function 
(CDF) of the sum of independent RVs tends toward the Normal CDF. 

We ended the chapter with additional examples of the use and application of the CLT. 


284 


Chapter 4 Expectation and Moments 


PROBLEMS 

(*Starred problems are more advanced and may require more work and/or additional 
reading.) 

4.1 Compute the average and standard deviation of the following set: 3.02, 5.61, —2.37, 


4.2 


4.3 
4.4 


4.5 


4.6 


4.7 


4.8 


4.9 


4.10 
4.11 


4.94, —6.25, —1.05, —3.25, 5.81, 2.27, 0.54, 6.11, —2.56. 
Compute E[X] when X is a Bernoulli RV, that is, 


xa fh Pxl)=p>0, 
~ 10, Px(0)=1-p>0. 


Let X =a (a constant). Prove that E[Y] =a. 
Compute E[X] when X is a binomial RV, that is, 


Px(k) = ({,) eh —py"* k=0,...,n, O0<p<l. 
Let X be a uniform RV, that is, 


_f(b-a)', 0<a<2r<b, 
Fx(a) = { 0, otherwise. 


Compute EX]. 

Let the pdf of X be fx(z) = va ~ os 
(i) Compute Fy (x); 

(ii) Compute E/X]; 


(iii) Compute o%. 


eas) 
k— 
Find E[X] if Px(x) = [ = Z ,« =0,1,...,k and 0, else. This PMF is called 
m+n 
k; 

the hypergeometric distribution and m,n,k are positive integers. 

In Problem 4.5, let Y S x? Compute the pdf of Y and E[Y] by Equation 4.1-8. 
Then compute E[Y] by Equation 4.1-9. 


Let Y 2 X? +1. Compute E[Y] and 02, if 


20, 0< 2 <1, 


fx(a) = o else. 


Let X be a Poisson RV with parameter a. Compute E[Y] when Y 4X? +5. 
Show that the mean of the Gaussian RV X: N (1,07) is ws. Start from the defining 
integral for the mean. 
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4.12 


In your physics courses, you have studied the concept of momentum p = mv in the 
deterministic that is, nonrandom sense. In reality, measurements of mass m and 
velocity v are never precise, thereby giving rise to an unavoidable uncertainty in 
these quantities. In this problem, we treat these quantities as RVs. So, consider 
an RV mass M with given pdf fy¢(m) and an RV velocity V with given pdf fy(v). 
We are also given the averages [i,, = E[M] and py = E[V] (that would presumably 
correspond to our measurements in the physics course). Assume that M and V are 
independent and nonnegative RVs. 
(a) Express the pdf of the momentum P = MV in terms of the known pdf’s 
fu(m) and fv(v). 
(b) Determine the expected value of the momentum zp = E[P] as a function of 
yy and py. 
Prove that if E[X] exists and X is a continuous RV, then |E[X]| < E||X|]. Repeat 
for X discrete. 
Show that if E[g;(X)] exists fori =1,...,N, then 


N N 
1» 229] =) Bloi(X)). 


A random sample of 20 households shows the following numbers of children per 
household: 3, 2, 0, 1, 0, 0, 3, 2, 5, 0, 1, 1, 2, 2, 1, 0, 0, 0, 6, 3. (a) For this set what 
is the average number of children per household? (b) What is the average number of 
children in households given that there is at least one child? 


E 


Let BS {a < X < b}. Derive a general expression for E[X|B] if X is a continuous 
RV. Let X: N(0,1) with B = {-1 < X < 2}. Compute B[X|B]. 
(Papoulis [4-3]). Let Y = h(X). We wish to compute approximation to E[h(X)] 
and E[h?(X)]. Assume that h(a) admits to a power series expansions, that is, all 
derivatives exist. Assume further that all derivatives above the second are small 
enough to be omitted. Given that E[X] = w and Var(X) = 07, show that 

(a) E[h(X)] ~ A(u) +h" (u)o? /2; 

(b) B[h?(X)] ~ h?(u) + ([A'(H)]? + heh! (us) )o?. 
Let X and Y be two RVs. The pdf fx(a) of X is given as 


327,0<2<1, 
fx(2) -{ 


0, elsewhere. 
The conditional pdf of Y given that X = a, denoted by fy;x(y|2), is given as 


_ | tye" 0 <<a <1, 
fy|x(y|z) = 0 elsewhere. 


(a) Find the joint pdf fx,y(x,y) of X and Y. 
(b) Find the conditional mean of Y given X = a, that is, E[Y|a]. 
(c) Find the marginal pdf fy(y) and its mean value E[Y]. 
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4.20 


4.21 


4.22 


4.23 


4.24 


A particular model of an HDTV is manufactured in three different plants, say, A, B, 
and C, of the same company. Because the workers at A, B, and C are not equally 
experienced, the quality of the units differs from plant to plant. The pdf’s of the 
time-to-failure X, in years, are 


fx(z) = Fexp(—2/5)u(2) for A 


fx(z) = ae exP(—2/6.5)u(2) for B 


fx(a) = a P(—2/10)u(2) for C, 
where u(x) is the unit step. Plant A produces three times as many units as B, 
which produces twice as many as C. The TVs are all sent to a central warehouse, 
intermingled, and shipped to retail stores all around the country. What is the expected 
lifetime of a unit purchased at random? 
A source transmits a signal O with pdf 


Qn), 0O<0< 27 
4(A) = ( ’ ss ’ 
fo) fe otherwise. 


Because of additive Gaussian noise, the pdf of the received signal Y when O = @ is 


Hata = een (45°) | 


Compute E[Y]. 
Compute the variance of X if X is (a) Bernoulli; (b) binomial; (c) Poisson; (d) Gaus- 
sian; (e) Rayleigh. 
An Internet Service Provider (ISP) has two types of servers that route incoming 
packets for its customers. The servers fail randomly and have been found to have 
time-to-failure distributions that are exponential with parameters (4, and [ly, respec- 
tively. Call these two RV failure times 7, and 75, and assume they are independent. 
Thirty percent of the servers are type 1 and 70 percent are type 2. If a server is 
picked at random, denote its time-to-failure by the RV T. 

(a) What is E[T]? 

(b) What is E[T?]? 

(c) What is the standard deviation or? 


Let X and Y be independent RVs, each N(0,1). Find the mean and variance of 
ZS JSXTFY?. 
Let X1, X2, X3 be three i.i.d. standard Normal RVs. We order them as Y; < Y2 < Y3. 


a) Compute FyiYe¥s (yi, Y2, y3); 
b) Compute E£[Y;] i = 1,2,3. 
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4.25 Let fxy(z,y) = 2 for 0 <2 <y <1 and zero else. Compute E[Y] and o4.. 
4.26 Let fxy(a,y) be given by 


4.27 


4.28 


4.29 


4.30 


4.31 


feveal) 1 x? + y* — 2pry 
x,y) = ex 
ih 2707/1 — p? 7 207(1 — p?) : 


where |p| < 1. Show that E[Y] = 0 but E[Y|X = 2] = px. What does this result say 
about predicting the value of Y upon observing the value of X? 
Let X and Y be two Gaussian RVs with mean 0 and variance o?. Let 


za 5(x 4+). 

(a) If X and Y are independent, what are the mean and variance of Z ? 

(b) Suppose X and Y are no longer independent. Let p be the correlation coeffi- 
cient of X and Y. Now, what would be the mean and variance of Z ? (Your 
answer may be in terms of p)? 

(c) Consider what happens when p = —1, p = 0, and p = +1. Is it always true 
that 

Show that in the joint Gaussian pdf with wy. = py =0 and ox = ay 4 o, the joint 
pdf asymptotically as p — 1, becomes 


1 1a? 
fxy (x,y) . afore exp | 2 (=) | sw—2) 
Consider a probability space P= (Q,.% P). Let Q = {¢,,...,¢5} = {-1, -$,0, $, 1} 
with P[{¢;}] =%,7=1,...,5. Define two RVs on 7 as follows: 
X()=¢ and Y(¢) 


(a) Show that X and Y are dependent RVs. 
(b) Show that X and Y are uncorrelated. 


Given the conditional Gaussian density 


1 
fy\x (yl) = Van 92 


for two RVs X and Y, what is the conditional mean E[Y|X]? Here a is a known 


constant. 
We wish to estimate the pdf of X with a function p() that maximizes the entropy 


A(X] a -[- p(x) In p(x) dx. 


It is known from measurements that E[X] = w and Var[X] = o?. Find the maximum 
entropy estimate of the pdf of X. 
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4.32 


4.33 


4.34 


4.35 


4.36 


4.37 


4.38 


4.39 
4.40 


Let X: N(0,07). Show that 


Mn & E[X"] =1-3...(n—1)o” — neven (4.8-1) 
Mn, = 0 n odd. (4.8-2) 


With py = E[X] and py = E[Y], show that if e141 = \/€a0¢02, then 


2 
C11 
B) (20x - nx) - v9) | =0. 
C20 
Use this result to show that when |p| = 1, Y is a linear function of X, that is, 


Y =aX +8. Relate a, G to the moments of X and Y. 
Show that in the optimum linear predictor in Example 4.3-4 the smallest mean-square 
error is 

Eonin - oF (1 _ a). 
Explain why ¢?,,, = 0 when |p| = 1. 
We are given an RV X with pdf fx(x) = 1 — (1/2)a, for 0 < x < 2 and zero else. 
Compute m,, the rth moment of X for r a positive integer. 


Let E[X;] = pw, Var[X;] = o?. We wish to estimate pz with the sample mean 


Compute the mean and variance of ji, assuming the X; for 1 = 1,...,N are inde- 
pendent. 
In the previous problem, how large should N be so that 

Plliiy — po] > 0.10] < 0.01. 
Let X be a uniform RV in (—4, 5). Compute (a) its moment-generating function; 
and (b) its mean by Equation 4.5-5. [Hint: sinh z S (e* — e *)/2. Use limits when 
computing the mean.] 
Let X be a Poisson RV. Compute its (a) MGF; and (b) its mean by Equation 4.5-5. 
The negative binomial distribution with parameters N, Q, P, where Q — P = 1, 
P>0,and N > 1, is defined by PMF 


Px(h) 8 (RTT) (a) Ge) (k = 0,1,2,...). 


It is sometimes used as an alternative to the Poisson distribution when one cannot 
guarantee that individual events occur independently (the “strict” randomness 
requirement for the Poisson distribution). Show that the moment-generating func- 
tion is 
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4.41 


4.42 
4.43 


4.44 


4.45 


4.46 


4.47 


4.48 


4.49 


4.50 


Mx(t) = (Q—- Pet)-®. 


[Hint: Either compute or look up the expansion formula for (Q—Pe')~, for example, 
see Discrete Distributions by N. L. Johnson and S. Kotz, John Wiley and Sons, 1969.] 
(get) ne een — 

Let X have pdf f(zs0, 8) = ee ) x ee z/B),0<a#<w,8B>0, a>0, 
; else. 

Find the moment-generating function of X. This is the gamma distribution. 

Find the mean and variance of X if X has a gamma distribution. 

Compute the Chernoff bound on P|X > aj, where X is an RV that satisfies the 

exponential law fx (2) = Ae~*"u(z). 

Let N = 1 in Problem 4.40. (a) Compute the Chernoff bound on P[X > k]; (b) gener- 

alize the result for arbitrary N. 


Let X have a Cauchy pdf 
a 


iz@= eee) 
Compute the CF ®x(w) of X. 

Let X have the Cauchy density: fx(x) = (a(1+(a—- a)?))" ,-0o < x < oo. Find 
E[X]. What problem do you run into when trying to compute 0%? 

Find the CF of the exponential RV X with mean py > 0, that is, 


fx(2) = 5e*!"u(2), 


where u(a) denotes the unit-step function. 


Let 
1 N 
Y= YX, 
vey 
where the X; are i.i.d. Cauchy RVs with 
fx,(2) : ee 
(2) = ———5 a=1,...,1V. 
MN aL (@ = 
Show that the pdf of Y is 
1 
VO* GO 


that is, is identical to the pdf of the X;’s and independent of N. (Hint: Use the CF 
approach. ) 

Let X be uniform over (—a,a). Let Y be independent of X and uniform over 
({n — 2Ja,na), n = 1,2,.... Compute the expected value of Z = X + Y for each 
n. From this result sketch the pdf of Z. What is the only effect of n? 

Consider the recursion known as a first-order moving average given by 


Re=F,—aZya “lal <i, 
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4.51 


4.52 


4.53 


4.54 


4.55 


4.56 


4.57 


4.58 


4.59 


where X,, Zn, Zn—1 are all RVs for n =..., —1,0,1,.... Assume E[Z,,] = 0 all n; 
E(Z,Z;] = 0 all n 4 j; and E[Z?2] = o? all n. Compute R,(k) 2 E[X,Xp—x] for 
b=, 21,29. ..o 

Consider the recursion known as a first-order autoregression 


Xn =dXn1tZn — (bl <1. 


The following is assumed true: E[Z,,] = 0, E[Z2] = 0? all n; E[Z,Z;] =0 all n £ 7. 
Also E[Z,Xy—-;] = 0 for j = 1,2,.... Compute R,(k) = E[X,Xy-x] for k = +1, 
+2,.... Assume E[X2] 2 K independent of n. 


Let Z2aX+bY, W ScX+dY. Compute the joint CF of Z and W in terms of the 
joint CF of X and Y. 
Let fxy(az, y) = 4exp(—4[a+y]), « > 0,y > 0. Find the joint MGF and CF function 
of (X,Y). 

Let X and Y be two independent Poisson RVs with 


Compute the PMF of Z = X + Y using MGFs or CFs. 
Your company manufactures toaster ovens. Let the probability that a toaster oven has 
a dent or scratch be p = 0.05. Assume different ovens get dented or scratched indepen- 
dently. In one week the company makes 2000 of these ovens. What is the approximate 
probability that in this week more than 109 ovens are dented or scratched? 
Message length L (in bytes) on a network can be modeled as an i.i.d. exponential RV 
with CDF 5 1 ex 0.0021 7 Sg 
— 0.0021 J > 0, 
PE<q2m@={'% 126) 
(a) What is the expected length (in bytes) of the file necessary to store 400 
messages? 
(b) What is the probability that the average length of 400 randomly-chosen 
messages exceeds 520 bytes? 


We have 100 i.i.d. RVs with means pz and variances 0”. We form their sample mean 
/4109- Make use of Chebyshev’s inequality to upper-bound the probability P[|Zij99. — 
| > 0/5]. We look for a numerical answer here. 

Your company manufactures LCD panels. Let the probability that a panel has a 
dead pixel be p = 0.03. Assume different panels get such defects independently. In 
one week the company makes 2000 of these LCD panels. Using the CLT, what is the 
approximate probability that in this week more than 80 panels have dead pixels? 
Let X; for i = 1,...,n be a sequence of i.i.d. Bernoulli RVs with Px(1) = p and 
Px (0) = q=1-—p. Let the event of a {1} be a success and the event of a {0} bea 
failure. 
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4.60 


4.61 


4.62 


4.63 


4.64 


4.65 


(a) Show that 
a 
Zn ae, Ne Wi, 
vn i=1 


where W; = (X; — p)/,\/pq, is a zero-mean, unity variance RV with a Normal 
CDF when n >> 1. 

(b) For n = 2000 and k = 110, 130, 150 compute P[k successes in 1 tries] 
using (i) the exact binomial expression; (ii) the Poisson approximation to the 
binomial; and (iii) the CLT approximations. Do this by writing three MATLAB 
miniprograms. Verify that as the correct probabilities decrease, the error in 
the CLT approximation increases. 


In Chapter 1, the following problem was solved using an approximation to the bino- 
mial probability law. 

Assume that code errors in a computer program occur as follows: A line of code contains 
errors with probability p = 0.001 and is error free with probability g = 0.999. Also errors 
in different lines occur independently. In a 1000-line program, what is the approximate 
probability of finding 2 or more erroneous lines? 

Can the Central Limit Theorem be used here to give an approximate answer? Why 
or why not? Explain your answer. 

Assume that we have uniform random number generator (URNG) that is well 
modeled by a sequence of i.i.d. uniform RVs X;, i = 1,...,n, where X; is the ith 
output of the URNG. Assume that 


fx, (Xi) = “rect (2) . 


a 


(a) Show that with Z, = X1+...+ Xn, E[Z,] = na/2. (b) Show that Var(Z,,) = 
na?/12. (c) Write a MATLAB program that computes the plots fz,(z) for n = 


2,3, 10,20. (d) Write a MATLAB program that plots Gaussian pdf’s N (4. a) 


for n = 2,3,10,20 and compare fz,(z) with N (4 ua ) for each n. (e) For each 


n compute Plu, — kon < Zn < by, + kon], where py, = na/2, 02 = na?/12 for 
a few values of k, for example, k = 0.1,0.5,1,2,3. Do this using both fz, (z) and 


N (4. me), Choose any reasonable value of a, for example, a = 1. 
Let fx (x) be the pdf of a real, continuous RV X. Show that if fx (a) = fx(—«x), then 
E[X]=0. 

Compute the variance of the Chi-square RV W,, = >>;_, X?, where X; is distributed 
as: N(0,1) and the X;, for i= 1,2,--- ,n, are mutually independent. (Hint: Use the 
result in Problem 4.32.) 

Let X be a Normal RV with X: N(y,07). Show that E{(X — y)?*+1} = 0, while 
BU(X — 2)?*] = [(24)!/2*kl]o*. 

(a) Write a MATLAB program (.m file) that will compute the pdf for a Chi-square 
RV Z, and display it as a graph for n = 30,40,50. (b) Add to your program the 
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4.66 


4.67 
4.68 


4.69 


4.70 


4.71 


4.72 


4.73 


4.74 


capability to compute P[u—o < Z, < +o]. Compare your result with a Gaussian 
approximation P[j—o < X < +o], where X:N(n,2n). 
Let X;,i =1,...,4, be four zero-mean Gaussian RVs. Use the joint CF to show that 


E[X1XoX3Xa] = E[X1 Xo] E[X3X4 


+ E[X1X3] E[X2 Xa] 
4+E[X2X3]B[X1 X4). 


Compute the MGF and CF for the Chi-square RV with n degrees of freedom. 
Let E[X;j] = , Var[X;] = 07. We wish to estimate jz with the sample mean 


Tl N 
s- x. 
4=1 


Compute the mean and second moment of {i assuming the X; for i = 1,...,N are 
independent. 
Is the converse statement of Problem 4.62 true? That is, if ELX] = 0, does that imply 


that fx(x) = fx(—2)? 

Let X : U(a,b), where 0 < a < b. With r 4 b/a, show the mean-to-standard deviation 
ratio, /o, approaches the value 1.73 as r — oo. 

Assuming that the X; are ii.d. and Normal, show that W, = IES — 4 
Si X;)/o]? is Chi-square with n — 1 degrees of freedom. 

(conditional expectation) Let Y = X + N, where the RVs X and N are independent 
Poisson RVs with means 20 and 5, respectively. 


(a) Find the conditional PMF of Y given X. 
(b) Find the conditional mean E[Y|X = a]. 


Derive the inequality ox P{|X| > ox] < E||X|] < ox that holds true if fx(#) = 
fx(—2). 


Consider two RVs X and Y together with given values for wx, py, 7X, oF, and p. 
We make a linear estimate of Y based on X, that is, 


\I> 
2 


ii 


n 


Y=aX +8. 
Define the estimate error as 
E = Y ~Y. 


(a) Then find the covariance of the estimate error and the data X, that is, find 
Cov[e, X] = EleX] — Ele] E[X]. 
Express your answer in terms of a and ( and the above given parameter 


values. 
(b) Set a and to their optimal values. Then evaluate Cov[e, X] again. 
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4.75 In Problem 4.74 we looked at estimating the RV Y from the RV X with the linear 
estimate . 
Y=axX +8. 
It turned out that the optimal values a and ( found in class resulted in Cov[e, X] = 0. 
Now it is relatively easy to show that this condition, that is, 


Covle, X] = 0, 


known as the orthogonality condition, holds for general linear estimation problems 
where, as above, we want to find the best linear estimate in the sense of minimizing 
the mean-square error. In words we say that the estimate error € is orthogonal to 
the data used in the estimate, in this case X. 

Here we consider a slight generalization of this problem. We now form a linear 
estimate of Y based on two RVs X, and Xo, that is, 


Y = ay xX, + ayn Xo + p. 
We will determine the values of a; and ag from the two orthogonality conditions 


Cov[e, Xi] =0 and Cove, X2] = 0. 


To make matters simpler, we assume that all three mean values are zero which implies 
6B =0, so that the linear estimate simplified to 


Y = aX, + A Xo. 


As before, the error is written as ¢ = Y — Y. Note that due to the means being 
zero, Covié, X1] = EleX1] and Covle, X2] = EleX2]. Please use the following values: 


o7 Wess 4, ae 4, 
Pp, = 0.5, po = 0.7, py = 0.5, 


where p, = E[|X1Y]/oioy, py = E|X2Y|/oacy, and py. = E[X1X2]/o102 here since 
the mean values are all zero. 


(a) Using these given values, write two linear equations that can be solved for 
a, and ag using the orthogonality conditions in the form 


EleX;] =0 and EleX2] = 0. 
(b) Solve these two linear equations for a; and ay. 
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P Random Vectors 


5.1 JOINT DISTRIBUTION AND DENSITIES 


In many practical problems involving random phenomena we make observations that are 
essentially of a vector nature. We illustrate with three examples. 


Example 5.1-1 
(seismic discrimination) A seismic waveform X(t) is received at a geophysical recording 
station and is sampled at the instants ¢),t2,...,t,. We thus obtain a vector X = (Xj,..., 


Xn)", where X; =x (t;) and T denotes transpose.' For political and military reasons, at one 
time it was important to determine whether the waveform was radiated from an earthquake 
or an underground explosion. Assume that an expert computer system has available a lot 
of stored data regarding both earthquakes and underground explosions. The vector X is 
compared to the stored data. What is the probability that X(¢) is correctly identified? 


Example 5.1-2 
(health vector) To evaluate the health of grade-school children, the Health Department of 
a certain region measures the height, weight, blood pressure, red-blood cell count, white- 
blood cell count, pulmonary capacity, heart rate, blood-lead level, and vision acuity of each 
child. The resulting vector X is taken as a summary of the health of each child. What is 
the probability that a child chosen at random is healthy? 


TAll vectors will be assumed to be column vectors unless otherwise stated. 
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Example 5.1-3 
(disease detection) A computer system equipped with a digital scanner is designed to recog- 
nize black-lung disease from x-rays. It does this by counting the number of radio-opacities 
in six lung zones (that is, three in each lung) and estimating the average size of the opacities 
in each zone. The result is a 12-component vector X from which a decision is made. What 
is the best computer decision? 


The three previous examples are illustrative of many problems encountered in engineering 
and science that involve a number of random variables (RVs) that are grouped for some 
purpose. Such groups of RVs are conveniently studied by vector methods. For this reason 
we treat these grouped RVs as a single object called a random vector. As in earlier chapters, 
capital letters at the lower end of the alphabet will denote RVs; bold capital letters will 
denote random vectors and matrices and lowercase bold letters are deterministic vectors, 
for example, the values that random vectors assume. 

Consider a sample description space 2 with point ¢ and a set of n real RVs X1, Xo,--- , Xn 
from 2 to the real line R. For each ¢ € 2 we generate the n-component vector of numbers 
X(¢) 2 (X1(0), Xo(0),..., Xn(Q) € RB". Then X 4 (X1,Xo,...,Xn) is said to be an 
n-dimensional real random vector. The definition is readily extended to a complex random 
vector. Let X be an n-dimensional random vector defined on sample space Q with CDF 
F(x). Then by definitiont 


Fx(x) 2 PLX, = figiewy Aas Sal: (5.1-1) 


By defining {X < x} = {X1 <11,...,Xn < @y}, we can rewrite Equation 5.1-1 concisely as 


D> 


Fx(x) 4 P[X <x). (5.1-2) 


We associate the events {X < oo} and {X < —oo} with the certain event 2 and impossible 
event ¢, respectively. Hence 


1 (5.1-3a) 
Fx(—00) =0. (5.1-3b) 


If the nth-mixed partial of Fx (x) exists we can define a probability density function (pdf) as 


A O”"Fx(x) 
x(x) = ———— .. 5.1-4 
t+We remind the reader that the event {X1 < a1,...,Xn < xn} is the intersection of the n events 


{X; < x;} for i=1,...n. If any one of these sub-events is the impossible event e.g., {X; < —oco} then the 
the whole event becomes the impossible event and we would still write Fx (—oco) = 0. 
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The reader will observe that these definitions are completely analogous to the scalar defini- 
tions given in Chapter 2. We could have defined 


A i Play < Ay Sy PAR, 2098 —Xy SZ, +A, 
= 1m 


roe Ag,... At, 


fx(x) 


(5.1-5) 


Azn—0 
and arrived at Equation 5.1-4. For example, for n = 2 
Play < X1 <4, + Aa, 22 < Xq < 42+ Azy] 
= Fx (a1 + Ag, 22 + Avg) — Fx (21, 2 + Avg) — Fx(a1 + Avy, x2) + Fx(a1, 22). 
Thus (still for n = 2) 


fx(x) = Fx (a1 + Avy, v2 + Ave) — Fx (a1 + Agi, 22) 


_ Fx (21,22 + Azz) + Fx (x1, 2£2)| 
which is by definition the second mixed partial derivative, and thus 


0? Fx (a1, @2) 


fx 1,22) = 


From Equation 5.1-5 we make the useful observation that 
fx(x)Ag,... Ag, ~ Play < Xy <1 + AN,...,0n < Xn < an + Az,] (5.1-6) 


if the increments are small. If we integrate Equation 5.1-4, we obtain the CDF as 


Fx(x) =, | , Fe de, s.. Oe, 


which we can write in compact notation as 
x 
Fx(x) = / fx (x')dx’. 
=o 


More generally, for any event B C RN (RN being Euclidean N-space) consisting of the 
countable union and intersection of parallelepipeds 


P[B) = = fx(x)dx. (5.1-7) 


(Compare with Equation 2.5-3.) The argument behind the validity of Equation 5.1-7 follows 
very closely the argument furnished in the one-dimensional case (Section 2.5). Daven- 
port [5-1, p. 149] discusses the validity of Equation 5.1-7 for the case n = 2. For n > 2 
one can proceed by induction. 
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The CDF of X given the event B is defined by 
A 
Fx\p(x|B) = P[X < x|B] 
_ PIX <x, B) 
PIB 


These and subsequent results closely parallel the one-dimensional case. Consider next the 
n disjoint and exhaustive events {B;,i =1,...,n} with P[B;] > 0. Then U%_, B; = 9 and 
BB; = ¢ for all i 4 j. From the Total Probability Theorem 1.6-1, it then follows that 


(P[B] #0). 


Fx(x) = ps Fx\p,(x| Bi) P[Bil. (5.1-8) 


The unconditional CDF on the left is sometimes called a mixture distribution function. The 
conditional pdf of X given the event B is an nth mixed partial derivative of Fx; g(x|B) if 
it exists. Thus, 


A O" Fx\p(x|B) 


B) = : 5.1-9 
Pxy(xl ) Ox, ...02y ( ) 
It follows from Equations 5.1-8 and 5.1-9 that 
fx(x) = S° fxje(*|B) PBI). (5.1-10) 
i=1 


Because fx (x) is a mixture, that is, a linear combination of conditional pdf’s, it is sometimes 
called a mixture pdf. 
The joint CDF of two random vectors K = (X1,...,Xn)? and Y = (Y1,..., Ym)" is 


Fxy(x,y)=P[X<x,Y<yl. (5.1-11) 
The joint density of X and Y, if it exists, is given by 


Q(n+m) Fry (x, y) 
Ly... O%n Oy ...OYm 


fxy(%y) = 5 (5.1-12) 


The marginal density of X alone, fx(x), can be obtained from fxy(x, y) by integration, 
that is, 


fx(x) = “ ve 7 fxy (x,y) dy... dm. 
es 


Similarly, the marginal pdf of a reduced vector X’ = (X1,...,Xn_1)” is obtained from the 
pdf of X by 


fx: (x’) S i fx(x) da, where x’ S (Cee eee (5.1-13) 


Obviously, Equation 5.1-13 can be extended to all the other marginal pdf’s as well by merely 
integrating over the appropriate variable. 


+ This usage is prevalent in statistical pattern recognition. 
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Example 5.1-4 
(particle at random) Let X = (X1, X2, X3)" denote the position of a particle inside a sphere 
of radius a centered about the origin. Assume that at the instant of observation, the particle 
is equally likely to be anywhere in the sphere, that is, 


3 Jind 2 2 
Fete) =| meena a 


0, otherwise. 


Compute the probability that the particle lies within a subsphere of radius 2a/3 contained 
within the larger sphere. 


Solution Let FE denote the event that the particle lies within the subsphere (centered at 
the origin for simplicity) and let 


RSF {21, 22,03: 4/0? + 22 + 22 < 20/3}. 


P\E) = // fx(@1, 2,03) dx, dx2 dx3 


Then the evaluation of 


is best done using spherical coordinates, that is, 


3 2a/3 por 20 
P[E] = za! | | r? sin ¢ dr do dd. 
4ra’ Jr-9 J¢=0 Jo=0 


Note that in this simple case the answer can be obtained directly by noting the ratio of 
volumes, that is, (2a/3)? + a? = 8/27 ~ 0.3. 


5.2 MULTIPLE TRANSFORMATION OF RANDOM VARIABLES 


The material in this section is a direct extension of Section 3.4 in Chapter 3. Let X be an 
n-dimensional random vector defined on sample space 2. Then consider the n real functions 


Yi = 91(21, £2,°++ , Ln) 
Y2 = 92(X1,£2,°+* , Ln) 
7 (5.2-1) 
Yn = Gal Lio; a ie) 
where the g;,i = 1,...,n are functionally independent, meaning that there exists no function 
H(y1,y2,---,;Yn) that is identically zero. For example, the three linear functions 
Yi = %1 — 2xQ + Bg 
Y2 = 321 + 2x9 ba al 2x3 (5.2-2) 


Y3 = 5x1 _ 2x9 ar 4x3 
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are not functionally independent because H(y1, y2,---; Yn) = 241 + ye — y3 = 0 for all values 
of 11, 22,23. We create the vector of n RVs Y 4 (Yi, Yo,..., Yn) according to 


Y = n(X1, Xe, i Xn) 


Yo = g2(X1,Xo,-°- ,Xn) (5.2-3) 


ye = Gn(X1, Xe, toile an) 

In this way we have generated n functions of n RVs. In order to save on notation, we let 

x4 (@1,29,---,2n), ¥ 4 (Y1, Y2;---;Yn) and ask: Given the joint pdf fx(x), how do we 

compute the joint pdf of the Y;,i =1,...,n, that is fy(y)? Note that if we start out with 

fewer RVs Y;, say i = 1,...,m, than the number of X;, say i = 1,...,n with m < n, we 
can add more Y; by introducing auxiliary functions as we did in Example 3.5-4. 

We assume that we can solve the set of Equations 5.2-1 uniquely for the v;,i = 

1,...,n, as 
t= b1(m1, Y2,°°° : Yn) 
L2 = b9(Y1,Yas°** s Yn) 


(5.2-4) 
Ln = by (Y1,Y25°** Yn): 


Now consider the infinitessimal event A {¢: 4, < Yi < y+ dy, t=1,...,n}. Here the 
Y; are restricted to take on values in the infinitesimal rectangular parallelepiped that we 
denote by Y,. Following the procedure in Equations 3.4-5 to 3.4-8, we write 


Plal= [ trydy= ty, = ff flo) de=fxOVe (6.25) 


where Y; is an infinitesimal parallelepiped (not necessarily rectangular), V, is the volume 
of F,, and Vz is the volume of Y,. From Equation 5.2-5 we obtain 


Va 

my) = xO) 7 

The ratio of infinitesimal volumes is shown in Appendix C to be the magnitude of the 
determinant J, given by 


(5.2-6) 


OP i... ORE 
- Oyr OYn 
J=| : (5.2-7) 
Ob, . Obn, 
Oy OYn 
ag Og. | 
Ox OLy 
as = J7}, (5.2-8) 
Ogn OGn 
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Hence 


fe(y) = fx)J| = fx(x)/lJI. (5.2-9) 


In general, the infinitesimal rectangular parallelepiped in the y system maps into r disjoint, 
infinitesimal parallelepipeds in the x system. Then the event A, as defined above, is the 
union of the events E;,i = l,...,r, where E; = {X € FZ} and F,™ is one of the r 
parallelepipeds in the x system with volume ve Since the regions and, therefore, the 
events are disjoint, the elementary probabilities P[E;] add, and we obtain the main result 
of this section, that is, 


=> x(x) J, (5.2-10) 


= = So sale )/Fil- (5.2-11) 


In Equations 5.2-10 and 5.2-11 | J;| 2 V/V, and |J| = |Ji[-}. 


Example 5.2-1 
(vector transformation) We are given three scalar transformations of vector x 


There are four solutions (roots) to the system 
yi = ap — 23 


2, 22 
Yy2 =U +X 


¥3 = 3. 
They are 

o\) = (yr + y2)/2)"? wy” = (yn + y2)/2)'/? 
2 =((ye—m)/2)/2 a = (ye — an) /2)*? 

(1) _ (2) — 
ce = ys oe (5.2-12) 
ay) = —((yr + y2)/2)¥? a) = —((ys + yo)/2)1? 
2) =((yr—w)/2"? of = —(yp— )/2)"? 

3 = ¥3 a 


For the roots to be real, yp > 0, yi + yo > 0, and yo — y; > 0. Hence yp > |yi|. In this 
case the single rectangular parallelepiped in the three-dimensional y space maps into four 
disjoint, infinitesimal parallelepipeds in three-dimensional x space. 
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Example 5.2-2 
(more vector transformation) For the transformation considered in Example 5.2-1, compute 


fy(y) if 


1 
fx(x) = (20)73/? exp 5 ("1 +25+23)|, 
i.e. X is a three-dimensional standard Gaussian RV. 


Solution We must compute the Jacobian |J| at each of the four roots. The Jacobian is 
computed as 


224 272 0 
J = |2a, +222 0] = 82129. 
0 0 1 


For example at the first root we compute 

Sy = (yg — yt). 

A direct calculation shows that |J,| = |Jo| = |J3| = |J4|. Finally labeling the four solutions 
in Equation 5.2-12 as x1,X2, x3, x4, we obtain 


j 4 
fry) = ie? > fx(xi) 
i=l 


n)73/2 
= eae exp | (us t ¥a)| x u(ye)u(y2 — lyi|)t. 


Although a random vector is completely characterized by its distribution or density function, 
the latter is often hard to come by except for some notable exceptions. By far the two most 
important exceptions are (1) when Fx(x) = Fx, (41)... Fx, (an), that is, the n components 
of X are independent, and (2) when X obeys the multidimensional Gaussian law. Case (1) 
is easily handled, since it is a direct extension of the scalar case. Case (2) will be discussed in 
Section 5.6. But what to do when neither case (1) nor (2) applies? Estimating multidimen- 
sional distributions involving dependent variables is often not practical and even if available 
might be too complex to be of any real use. Therefore, when we deal with vector RVs, we 
often settle for a less complete but more computable characterization based on moments. 
For most engineering applications, the most important moments are the expectation vector 
(the first moment) and the covariance matrix (a second moment). These quantities and their 
use are discussed later on in the chapter. Next, we consider random vectors with ordered 
components. 


5.3 ORDERED RANDOM VARIABLES 


In Section 3.4 (Examples 3.4-3 and 3.4-5) we introduced the notion of two ordered RVs. 
Here we generalize to n RVs and obtain some important results regarding these. Ordered 
RVs are quite important because in the absence of any information about the distribution 
of the RVs, the statistics of the ordering transformation can give us significant information 
about such parameters as the median, range, and others that are closely related to the 


+ It would be challenging to show that this pdf integrates to unity. 
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parameters of distributions. Consider n i.i.d. continuous RVs each with pdf fx(a), where 
—coo < @ < o. The joint pdf of all n RVs is fx,..x,(@1,°°°,¢n) = fx(*1)-++ fx(an) 
and the joint marginal density of, say, X; and X,, is obtained by integrating out with 
respect to %,...,%n—-1. Now arrange the n RVs in order of increasing size; that is, if 
X, = min(X1,--- ,X,) then Y; = X,, and Y is the next smallest of the {X;,7 = 1,...,n, 
i # k}, and Y3 is the next smallest after that until, finally, Y,, = max(X,,--- ,X,). We thus 
have performed an ordering transformation, and we can write that the strict inequalities 
YY, < Yo <---< Y,_-1 < Y, occur with probability 1, since the X; are assumed continuous 
RVs. We wish to find the joint pdf of the {Y;,7 = 1,...,n}. At first glance we might argue, 
incorrectly, that since the set S, = {X;,7 = 1,..,n} contains the same elements as the set 
S> _ {Yj,7 = 1, . nh, 


FY (Yast Yn) = Fixx Y1s++ Yn) 
= fx.(y1)--+ fxn (Yn) 
for {y; : —co < y; < 00,7 = 1,...,n}. However, this result ignores the fact that the {Y;,7 = 
1,..,n} are not independent random variables. For example if you have observed X,, what 
have you learned about X2 from observing X,? Nothing it turns out but if you are given 
Y,, you know right away that Y2 > Y; and you also know that the probability that Y; > Y2 
is zero. Hence there is no probability mass in the region y; > yz. With this in mind we 
might want to modify the joint pdf’s of the {Y;} to 
Fy-¥n (Yrs 0+ Yn) = Fri (91) + Fen (Yn) for yr < y2 <-++Yn 
= 0, else. 
However, now we have another problem: The volume enclosed by the modified joint pdf is 
not unity. Indeed for n large it could be substantially smaller than unity. To get the correct 
joint pdf for the {Y;}, we shall use the results of Section 5.2, which allow us to compute the 
pdf of one set of RVs that are functionally related to another set whose pdf we already know. 


We begin by partitioning the n-dimensional space (—00 < 21, %2,°+: ,@n < 00) into n! 
nonoverlapping, distinct regions described by .#; = {21) < Tit2) < +++ Viz) < +++ < Lny} 
forl <i<nl!l,1<j <n, and 2,;) € {21,%2,--- , Zn}. Note that 2,73, < 24%) for 7 < k. 


Each region will have a different size-ordering of its elements. For example, consider 3-space 
(a1, 72,23). Then a distinct, nonoverlapping partition is 


Rey = (Gan <%Q< x3) 


Vo v1<%3< 22 
Fey = (tq <4, < £3 


%3< 4%, < XQ 


( ) 
( ) 
a (x9 <%< x1) 
( ) 
(a3 <a <4). 


For each of the n! regions we define y1 = L5i(1) < Ye 2 Li(2) <+ "Yn = Li(n)ij = 1, « 
For example in 3-space (21, 22,273) we have 


for Ay: Y1 = 21; y2 = C2; y3 = 13 
for #2: Y1 = 13 y2 = ©33y3 = 13 
for 43 3 Y1 = 23 y2 = 15 y3 = 13 
for 44: Yi = ©23y2 = 133 y3 = 1 
for 45 3 Yi = 33 Y2 = ©13y3 = v2 
for 26: Y1 = ©3; Y2 = T2; Y3 = V1. 
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Thus in 3-space (11, “2, #3), there are six sets of transformation equations and 6 = 3! distinct 
solutions for y; < y2 < y3; there are no solutions in y-space otherwise: 


in 41: y1 = 91(£1, 02,43) = 21; 2 xt! = $(y1, 42,93) = 91 
y2 = hi(@1,%2,%3) = X23 2 2 = 91(Y1, Y2, ¥3) = Y2 
ys = q1(21, 2,23) = ©3323 () = 1(y1, Y2, y3) = y3 

in Za: yi = ga(@1, £2, 23) = naa = b4(¥1, ¥2, 93 = 1 
yo = ha(x1, 22,03) = 13; 24 = gly, ye, ys) = y2 
y3 = qa(%1, 22,23) = £152 GC  _ 9, Y1,Y2,Y3) = Ys 

in #2: y1 = 92(v1, 22,23) = 21 2?) = $2(Y1, 92,93 = Yi 
yo = ho(a1, 2,43) = ae = %o(¥1, Y2,¥3) = Y2 
Y3 = qo(21, 22,03) = 22; a?) = Aly 1, Y2,Y3) = Ys 

in 25: y1 = 95(a1, 22,03) = 23:2) = ds(y1,y2,43 = Yn 
Y2 = hs(@1, 22,03) = £152 (8) = Y5(Y1,Y2.Y3) = Yo 
ys = 95(21, 22,23) = x2; 29” = 05(y1, 42,43) = ys 

in 43 : y1 = 93(21, 22,03) = xo; 25) = 63(Y1, Y2, ¥3 = Yi 


Sa Se 
w 
CS 


yo = h3(x1, £2, 23 
Y3 = q3(a1 »%2,%3 
( 


I 
e 
8 

I 
BS 


3(Y1,Y2, 93) = Y2 
3(41, Y2,Y3) = Ys 


SS 
ll 
8 
3 
a 
cmc) 
ll 
D> 


in 46: 1 = 96(X1,%2,%3) = 13323 = b6(Y1,Y2,¥3 = V1 
yo = he(r1, £2, £3) = ait = Po(W1, Y2, ¥3) = Y2 
¥3 = Ga 21, 22,23) = = 11524 (6) 6 V1, Y25Y3) = 93. 


The magnitude of the Jacobian of each of these transformations is unity so that Equation 
5.2-11, specialized here (in slightly different notation) for three ordered i.i.d. RVs, yields 
= fcyxaxa(ay”, 2”, 28”) 

mal | Jmn| 


SYiveYs (yi, Y2, ys) = 


=> ea ° a\”) ) fx (a ™)) fx (23). 


Finally, expanding the summation and inserting the appropriate solutions, we obtain 


fyivovs(Y1, yo, ¥3) = fx (yr) fx (y2) fx (ys) + fx (yr) fx (ys) fx (y2) + fx (Ya) fx (y1) Fx (ys) 
+ fx (Y2) fx (ys) fx (yr) + fx (ys) fx (ui) fx (ye) + fx (ys) fx (y2) fx (y1) 
= 3! fx (y1) fx (yo) fx (y3)- 


This result applies when y1 < y2 < y3; otherwise fy, y,y3 (41, y2; y3) = 0. 
We now summarize the result for the general case. We are given n continuous i.i.d. RVs 
n 
with pdf fx,..x,(G1,°°: ,%n) = [] fx (ai) with —oo < 41, %2,:+++ ,&%p < oo and consider 
i=1 
the transformation that orders them by signed magnitude so that Y1 < Yg <-:: < Yn, 
where for i = 1,...,n, Y; © {X1, Xo,--- , Xn}. Then 
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dy, (from —ce to c) 


dy, (from —o to y,) 


yy 


Figure 5.3-1 Showing integration regions for two ordered random variables. 


nm 
nm! TT fx(ui), for —0co < yy < Yo <-+++ < Yn < OO 
i=1 


f¥--¥n (Yi Yn) = (5.3-1) 


0, else. 


If fy,--y,(y1,°+* , Yn) is a true pdf, it must integrate out to 1. This requires an n-fold iterated 
integration. 

To show how this integration is done we consider the n = 2 case. Then integrating 
the function 2 fy, ¥5(y1, y2) = 2fy,(y1) fys(y2) over the region —0o < yi < y2 < co requires 
integrating the integrand from —oo < yi < y2 followed by an integration from —oo < y2 < 
oo. This is shown in Figure 5.3-1. Since —oo < yi < y2 we integrate the y; variable from —oo 
to y2; then we complete the integration over the half-space by integrating the y2 variable 
from —oo to oo. 

The extension to the n-dimensional case is straightforward: We integrate the y; variable 
first from —oo to yo; next the yo variable from —oo to y3, etc.; finally the y, variable gets 
integrated from —oo to oo. In this fashion we have integrated over the entire subspace 
—0O <Y <+++ < Yn < co. The last integration yields 


nt fo FE" yn)x(tm)dvn/ (= = nt [FE M(Un AP (Un) l(t 1)! FRU n) By =D 


—oco —oo 


The next development leads to the fundamental result of order statistics. 


Distribution of area random variables 


We begin by defining the area RVs 


Y; 
Z a fx(a)dz,i=1,...,n, (5.3-2) 
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where fx(a) is the pdf of a continuous RV X, and Xj,--- ,X,, are ni.i.d. observations on 


X. After ordering we obtain the Y;,7 = 1, ...,n, as the ordered RVs where min(X,..., Xn) . 


Y¥,<Yo<::-<Y, = max(X1,...,X;,). We denote Z; somewhat informally as an “area RV” 
because the RV Z; is the area under fx(x) up to Y;. Clearly, because Y; is an RV so is Z;. 
Indeed, we can think of Z; as a CDF with a random argument, hence we may also speak of 
it as a random CDF. We recognize that Z, <--- < Z, because Yi <--- < Y, and Z; isa 
monotonically increasing function of Y; for every index 7. We consider the transformation 


Yi 


where Fx (x) is a continuously increasing function of x, and hence has a unique inverse at 
every x. The roots of these equations are ys” = Fy (zi),1 = 1,...,n, (see Figure 5.3-2) and 
the Jacobian is 


d1 gg4 du, |fxi?) 0 e © 0 

CC. 0 fx(yS”) 0 

* = % 0”. © f=], fx@t”). (6.33) 
e e . we 

Ozn Ozn ° 7 0 

Oy1 Oyn 0 ; ° 0 Fety) 


Hence the pdf of the Z;,7 = 1,...,n, is determined as 
Tina fx (uy) 


fay.-Z,(Z15°°° wn) Hn! =n, 0O< 2 <2 < +++ < amy <l 


This fx yp”) (5.3-4) 
= 0, else. 

y 

y= F(z) 
Vn 
Ji 
Zz 

0 Z, Z, 1 
Jy 


Figure 5.3-2 Finding the roots of the transformation y = Fy'(z). 
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This non-intuitive result says that the pdf of the Z;,i = 1,...,n, does not depend on the 
underlying pdf fx(a). Equation 5.3-4 enables us to derive a number of important results 
useful in estimating various parameters when we don’t know the underlying distributions. 
See for Example 5.3-5. 


Example 5.3-1 
(area under fx(a) between the smallest and largest observations) We wish to compute 


the area under fx(a) between the smallest, Yi = min(X1,--- ,X,,), and largest, Y, = 
max(X,,--- ,X,,), of the observations in a sample of size n drawn from the pdf fx(a). We 
denote this area with the new random variable 
ee 
Vin = fx(a)dz . (5.3-5) 
Yi 
We note that 
Yi. Yi 
Vin = / fx (a)da — fx(a)dxz = Z, — Zy (5.3-6) 
hence we need to compute fz, z, (21, Zn) from fz,...z,(21,'+* ;2n). This requires integrating 


Equation 5.3-4 over z2, 23,.--,2n—1, recalling that z;-1 < z; < 1. The result is 


fz,z,,(21;2n) = n(n—1)(2n — 21)" for 0 < 2 < 2% <1jn>2 
(5.3-7) 

= 0, else. 

Consider now two new RVs Vi, 2 Zn — 21,W 4 Zy. To find the pdf of fyw(v,w), we 

consider the transformation v = Zp, — 21,W = Zn;0 <u <w <1. The Jacobian magnitude of 

this transformation is 1 and the only solution to this transformation is 2”? =w-v; ZW? = Ww. 


pone fv,,w(v,w) =n(n—-1)v"-? for0<w-v<w<l1,n>2 


= 0, else. 


To get the pdf of V,,, alone, we integrate out with respect to w. To help with the integration, 
we note that the two inequalities w— v > 0 and w < 1 suggest the triangular region of 
integration shown in Figure 5.3-3 


Figure 5.3-3 Region of integration for computing the probability density function of Vin. 
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Beta CDF 


0 0.2 0.4 0.6 0.8 1 1.2 


Figure 5.3-4 The beta CDF (Equation 5.3-9) for n = 2 (top curve); n = 4 (middle curve); n = 10 
(bottom curve). 


Thus starting with fy,,(v) = n(n —1)v”~? is dw we obtain 


n(n —1)v"-2(1—v), forO<u<l1n>2 


0, else (5.38) 


Frin(0) = { 
This pdf is a special case of the beta density given in Section 2.4 with a = n— 2,0 = 1. 
The distribution function is the probability that the area spread between the largest and 
the smallest is less than or equal to v. It is readily computed as 


Fy,,,(v) = 1, v>1 (5.3-9) 
0 


The beta CDF is shown in Figure 5.3-4 for various values of n. 


Example 5.3-2 
(area between any ordered RVs) We can extend the above results to computing the density 
of the areas under fx(a) between any ordered RVs, not necessarily between the first and 
the last. We generalize the notation slightly so that 


Yn Y; Ym 
Vien = Zin — 2 = | fx(a)de— | fx(a)dx = fx(a)dxz,m>1. — (5.3-10) 
—oo —oo Yi 


Consider 0 < Z1 < Zo < 23 <1 with fay ZZ5 (21, 22; 23) = 3!,0 < 21 < 22 < 23 < 1. We 


first consider the density, fy,,(v), of the RV V23 = Z3 — Za. Since this involves only Z2 and 
Z3, we must compute fz, 7,(Z2, 23) from fz,z,7,(21, 22; 23). This is done as 


sie dz, = 3!zo, for 0 < z2 < 23 <1 
0, else 


f2.2 (22, 23) = { 
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To compute fy,,(v) from fz,z,(22, 23), we define an auxiliary RV B 2 Z, with realizations 
(@ and an appropriate set of functional eee In as case a suitable set of functional 


; A A . . 
equations are v = 23 — Za, 3 = Z with roots zh”) = 9,23’ =v+ . The reader will recognize 


that 3 4 zg serves as the auxiliary variable. Then, using the so-called direct formula yields 


FBV25(8, 0) = f2223(8,v + B)/|J| _ 3! for0<B<1l—-v 
= 0, else 


as the Jacobian magnitude |.J| of the transformation is unity. 
Finally, integrating over the auxiliary variable (3 yields 


_ l-v — 3'(1-v)? 
fVo3(v) _ * ee Bdp ~~ er 0) asl (5.3-11) 


To compute fy,,(v) from fz, z,z,(21, 22, 23), we proceed in the same fashion. Here we find 
that fz, z.(21, 22) is given by fz,z,(21, 22) = 3!(1 — 22) for 0 < 21 < zg < 1 and 0 else. 


Then, using the transformation v = z2—- 21,8 S z, we get the result 


_ l-v _ 3!(—v)? 
Frale) = 31 fg “0-9 — B)d8 = MG Ov (5.3412) 


We leave the details to the reader. 

The general case is given by the following: let Vj, denote the probability area under 
fx(a) between Y; and Y,,, of the samples ordered by size Y;, Y2,...,¥, drawn from the pdf 
fx(a). Then the pdf of Vj, is given by 


Nim) = Graysmith O<v<d (5.3-13) 

= 0, else. 
Example 5.3-3 
(expected value of area under fx(x) between ordered samples) Consider the area RVs 
0< 2% < Zy < Z3 < 1, where Z; is given in Equation 5.3-2. We wish to compute E[Z;] 
for 1=1,2,3, expecting that E[Z,] < E[Z2| < E|[Z3]. We find that the marginal pdf’s 
fz,(%),4=1, 2,3, are computed as 


fz,( 21) =I, i fz; ZZ3 21522; 23)dzodz3 = — OF ati ~~ ay 
fz, (22) =f fi f 2, 22421, 22, 23) dz1dz3 = 3!zo(1 — 22) 
fzs(23) = Jo° So” fai ZZ (41, 22, 23)dz1dzq = 322. 


From these results it follows that 


E[Z;] = Jo zfz,(z)dz =1 = 4 
E[Z2] = Jo, zfz,(z)dz =2 = 4 
E[Z3] = fs 2fz,(z)dz 3 ae 
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which suggests that in the general case, that is 0 < 21 < Zg < +++ <2, <1, 


E[Zi] = . (5.3-14) 


The general case can be obtained by induction. 


Example 5.3-4 
(moments of area between ordered samples) Consider the area between two adjacent ordered 
samples. This is given by Vii41 = Zi41 — Z;. The pdf of Vii41 is given by Equation 5.3-13 
by letting m = i+ 1,1 =i, which yields fy, ,,,(v) =n(1—v)""1, for 0 <u <1 and Q, else. 
Note that this result is independent of 7. From this we compute 


where the gamma function (7) = (7 — 1)! for 7 = 1,2,... and use was made of tables of 
integrals (see for example formula 497, p.67 in A Short Table of Integrals by B. O. Peirce 
and R. M. Foster, Ginn and Company, New York, 1956). The integral can also be found 
online at several places, including www.wolframalpha.com (type integral’ at the prompt). 
Likewise 

T(3)P(n) 2!(n — 1)! 2 


BV] =n f a ey =n (n +2)! -_ fa DG) 


forn >> 1. 


2 _ 2 oes hey on lh 
Hence oy, 4. = @e@aa) ~ Mi? ~ jr 


To compute the variances a7 ,i = 1,...,n, we first compute E[Z?] for i =1,...,n as 
p Zi? a 


E(Z?| =n! ic ne a ee -( ie zidz1) ++ d%y-1d%m = 2 ((n + 2)(n + I) 


E[Z2] =n! bi Fi es site ig 2 Te: dzdzq-+- din, =6((n+ 2)(n+ 1)) 


E[Z2| = nl fo 22 So” So" fo" So” deaden dan 
=n(n+1)/((n+2)(n+1)). 


It follows that 
i(@+1) 


Be = Gre Tin +3) 


j fort ST 25,7 


and the variances, computed as 0? = E[Z?] — E?[Z;], yield 


2 = ey) : x ’ forn >> 1 
"4 (a+ (R42) (+P ~ (+1? : 


Thus for large n, 0%, © E[Z;]/(n+ 1). 
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Example 5.3-5 
(Estimating range of boot sizes) Military boots need to be ordered for fresh Army recruits 
but the manufacture needs to know what range of boot sizes will be required. It is suggested 
that a random sample of n recruits be measured for required boot size. What is the minimum 
value of n that will cover at least 95 percent of the boot-size needs of the recruits? 


Solution Let {X;,i=1,...,n} denote the i.i.d. set of boot sizes of the n recruits drawn 
from a population with (unknown) pdf fx(x) and let {Y;,i = 1,...,n} denote the order 
statistics of the observations. With Vi, = (es fx (x)dx we need to solve P[Vi,, > 0.95] = 6, 
where 6 is a measure of the reliability of our estimate of n, that is, in 1006 percent of the 
time, the number n will indeed be the minimum sample size required for estimating the 
boot needs of the recruits. Using P[V,, < 0.95] = 1— 6 and Equation 5.3-9 we compute 
n = 93 for 6 = 0.95 and n = 114 for 6 = 0.98. The solution is obtained numerically using 
Excel™. Note that the result is independent of the size of the recruit army. 


5.4 EXPECTATION VECTORS AND COVARIANCE MATRICES' 


Definition 5.4-1 The expected value of the (column) vector KX = (X1,...,Xn)" is 
a vector yt (or X) whose elements j;,..., /4, are given by 


ui | | Uifx(@1,.-.,0n) dx... dtp. (5.4-1) 


Equivalently with 


A 


fx; (x;) = / soa ai fx(x) dx, aoe dx; dxi41 see din 


the marginal pdf of X;, we can write 


—cCo 


Definition 5.4-2 The covariance matrix K associated with a real random vector X 
is the expected value of the outer vector product (X — )(X — p)*, that is?, 


A 
K = E[(X — p)(X — p)"]. (5.4.2) 
We have for the (7, 7)th component 


Kay 2 El(X: — w,)(X; — 1,)] 


+This section requires some familiarity with matrix theory. 
*We temporarily dispense with adding identifying subscripts on the mean, covariance and other vector 
parameters since it is clear we are dealing only with the RV X. 
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In particular with o? 4 Ki;, we can write K in expanded form as 


ot es Kin 
K= o? : . (5.4-4) 
Kyi bees e 
If X is real, all the elements of K are real. Also since Ky; = Kyj;, real-valued covari- 


ance matrices fall within the class of matrices called real symmetric (r.s.). Such matrices 
fall within the larger class of Hermitian matrices.’ Real symmetric matrices have many 
interesting properties, several of which we shall discuss in the next section. 

The diagonal elements o? are the variances associated with the individual RVs X; 


i 
for i = 1,...,n. The covariance matrix K is closely related to the correlation matrix R 
defined by 


R > E[Xx" |. (5.4-5) 
Indeed expanding Equation 5.4-2 yields 

K=R- pp" 
or 

R=K+ypp’. (5.4-6) 


The correlation matrix R is also real symmetric for a real-valued random vector and is 
sometimes called the autocorrelation matrix. Random vectors are often classified according 
to whether they are uncorrelated, orthogonal, or independent. 


Definition 5.4-3 | Consider two real n-dimensional random vectors X and Y with 
respective mean vectors ftx and ply. Then if the expected value of their outer product 
satisfies 

E{XY"} = pxpy’, (5.4-7) 


X and Y are said to be uncorrelated. If 


E{XY"}=0 (an nxn matrix of all zeros), (5.4-8) 


X and Y are said to be orthogonal. 


}The class of n x n matrices for which Ki = Kj; For a thorough discussion of the properties of such 
matrices see [5-2]. When X is complex, the covariance is generally not r.s. but is Hermitian. 
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Note that in the orthogonal case E{X;Y;} = 0 for all 0 < i, 7 <n. Thus, the expected 
value of the inner product is zero, that is, E[X7Y] = 0, which reminds us of the meaning 
of orthogonality for two ordinary (nonrandom) vectors, that is, xy = 0. 

Finally if 


fxy (x,y) = fx(x)fy(y), (5.4-9) 


X and Y are said to be independent. 


Independence always implies uncorrelatedness but the converse is not generally true. An 
exception is the multidimensional Gaussian pdf to be presented in Section 5.6. It is often 
difficult, in practice, to show that two random vectors are independent. However, statistical 
tests exist to determine, within prescribed confidence levels, the extent to which they are 
correlated. 


Example 5.4-1 
(almost independent RVs) Consider two RVs X1 and X2 with joint pdf fx,x,(@1,21) = 
%14+ "2 for 0 < 27, < 1,0 < 22 < 1, and zero elsewhere. We find that while X; and X2 are 
not independent, they are essentially uncorrelated. To demonstrate this, we shall compute 
B(X1 — 44)(Xe — p)] a8 


Kyg = Ko, = Rai — popy- 
We first compute 


[ly = My = /I x(a + y) da dy = 0.583, 
Ss 


where S = {(@1,%2):0< a, <1, 0< a2 < 1}. 
Next we compute the correlation products 


Ry = Ro = /I xcy(a + y) dx dy = 0.333. 
s 


Hence Ky2 = Ko; = 0.333 — (0.583)? = —0.007. Also we compute 
1 
c= | a*(x + $) dx — (0.583)? = 0.4167 — 0.34 = 0.077. 
0 


Hence the correlation coefficient (normalized covariance) is computed to be p = Ky2/0102 = 
—0.091. For the purpose of predicting X by observing Xj, or vice versa, one may consider 
these RVs as being uncorrelated. Indeed the prediction error ¢ in Equation 4.3-22 from 
Example 4.3-4 is 0.076. Were X,, X2 truly uncorrelated, the prediction error would have 
been 0.077. The covariance matrix K for this case is 


0.077 —0.007 1 —0.09 
ne Eye in| eee Ee 1 | 
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5.5 PROPERTIES OF COVARIANCE MATRICES 


Since covariance matrices are r.s., we study some of the properties of such matrices. Let M 
be any nx nr.s. matrix. The quadratic form associated with M is the scalar q(z) defined by 


q(z) 2 2'Mz, (5.5-1) 
where z is any column vector. A matrix M is said to be positive semidefinite (p.s.d.) if 
z! Mz > 0 


for all z. If the inequality is strict, ic. 2’ Mz > 0 for all z 4 0, M is said to be positive 
definite (p.d.). A covariance matrix K is always (at least) p.s.d. since for any vector z S 
(Hiscasgear 
0 < E{[z"(X— p))?} 
= 27 E(X — p)(X— p)" 2 
=2z' Kz (5.5-2) 
We shall show later that when K is full-rank, then K is p.d. 


We now state some definitions and theorems (most without proof) from linear algebra 
(5-2, Chapter 4] that we shall need for developing useful operations on covariance matrices. 


Definition 5.5-1 The eigenvalues of an n x n matrix M are those numbers \ for 
which the characteristic equation M@ = X¢@ has a solution @ 4 0. The column vector 
& = (61, b9,---,¢,)* is called an eigenvector. 


Eigenvectors are often normalized so that 6’ @ 2 |g? =1. 


Theorem 5.5-1 The number 4 is an eigenvalue of the square matrix M if and only 
if det(M —\I)=0.1 


Example 5.5-1 
(eigenvalues) Consider the matrix 


4 2 
M= | : ‘| | 
The eigenvalues are obtained with the help of Theorem 5.5-1, that is, 


4—x 2 


act] 9 ra 


| =@-a?-4=0, 


whence 


tdet is short for determinant and I is the identity matrix. 
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The (normalized) eigenvector associated with A; = 6 is obtained from 
(M - 61) =0, 
which, written out as a system of equations, yields 


—20, os 26) = 0 = es T 
ee op ea yD : 


” 


The double arrow => means “implies that.” The eigenvector associated with Ay = 2, 
following the same procedure as above, is found from 


ey amy 1 ( i) 

> = . 

2b + 2b, =0f > > 7 \-1 

Not all n x n matrices have n distinct eigenvalues or n eigenvectors. Sometimes a matrix 
can have fewer than n distinct eigenvalues but still have n distinct eigenvectors. 


Definition 5.5-2. Two nxn matrices A and B are called similar if there exists an 
n xX n invertible matrix T, i.e. det T 4 0, such that 


T'AT=B. (5.5-3) 


Theorem 5.5-2) Ann xn matrix M is similar to a diagonal matrix if and only if M 
has n linearly independent eigenvectors. [i 


Theorem 5.5-3 Let M be anr.s. matrix with eigenvalues A1,..., An. Then M has n 
mutually orthogonal unit eigenvectors ¢,,...,¢,'. I 


Discussion. Since M has n mutually orthogonal (and therefore independent) unit eigen- 
vectors, it is similar to some diagonal matrix A under a suitable transformation T. What 
are A and T? The answer is furnished by the following important theorem. 


Theorem 5.5-4 Let M be areal symmetric matrix with eigenvalues \;,...,A,. Then 
M is similar to the diagonal matrix A given by 
Ay 0 
AS 
0 Xn 


under the transformation 
U-!MU =A, (5.5-4) 
where U is a matrix whose columns are the corresponding? orthogonal unit eigenvectors 
@;,t=1,...,n, of M. Thus, 
U = (q@j,.--, Gy): (5.5-5) 
Moreover, it can be shown that U7U = I (and that U7 = U~!) so that Equation 5.5-4 


can be written as 
U'MU=A. Bf (5.5-6) 


tOrthogonal eigenvectors ¢; such that ||¢;|| = 1 are said to be orthonormal. 
+ That is, d; goes with A; fori =1,...,n. 
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Discussion. Matrices such as M, which satisfy U7 U = I, are called unitary. They have the 
property of distance preservation in the following sense: Consider a vector x = (21,...,%n)". 
The Euclidean distance of x from the origin is 


IIx|] 4 (x? x)/?, 


where ||x|| is called the norm of x. Now consider the transformation y = Ux, where U is 
unitary. Then 
lly|P? =y7y = x7 UT Ux = ||x||?. 


Thus, the new vector y has the same distance from the origin as the old vector x under the 
transformation y = Ux. 

Since a covariance matrix K of a real random vector is real symmetric, it can be readily 
diagonalized according to Equation 5.5-6 once U is known. The columns of U are just the 
normalized eigenvectors of K and these can be obtained once the eigenvalues are known. The 
diagonalization of covariance matrices is a very important procedure in applied probability 
theory. It is used to transform correlated RVs into uncorrelated RVs and, in the Normal 
case, it transforms correlated RVs into independent RVs. 


Example 5.5-2 
(decorrelation of random vectors) A random vector X = (X1,X2,X3)" has covariance 
matrix! 


2 =] 1 
Kxx = | -l 2 0 
1 0 2 


Design an invertible linear transformation that will generate from X a new random vector Y 
whose components are uncorrelated. 


Solution First we compute the eigenvalues by solving the equation det(Kxx — ATI) = 0. 
This yields Ay = 2, Xx = 24+ V2, A3 = 2— V2. Next we compute the three orthogonal 
eigenvectors by solving the equation (Kxx — A;I)¢@; = 0, i = 1,2,3 and normalize these to 
create eigenvectors of unit norm. Unit normalization is achieved by dividing each component 
of the eigenvector by the norm of the eigenvector. This yields 


é =(04 4) 
1 "fa? of ’ 

. ioe 
b= (F-35) ’ 


tHere we add subscripts to K to help distinguish the covariance matrix of one random variable from 
that of another. 
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Now we create the eigenvector matrix U = [¢, @, $3] that, upon transposing, becomes an 
appropriate transformer to make the components of Y uncorrelated. With 


0 1 1 
V2 V2 
A=vut 1 1 1 
7 maine) 2: De i 
1 1 
J2 2 2 
the transformation Y = AX yields the components 
ae oe 
1= a 2 3 
1 1 1 
Yo = xX x xX 
2 a 1— 942 + 983 
1 1 1 
Y¥3 = Xi, + =X X3. 
3 a 1t 5427 543 


The covariance of Y is given by 


2 0 0 
Kyy =|0 2+2 0 
0 0 G=4/2 


Actually we could go one step further; by scaling the three components of Y, separately, 
we can make the variance (average AC power) the same in each scaled component. This 
process is called whitening and is discussed in greater detail below. Clearly if Y; is scaled 
proportional to Tm Y2 is scaled proportional to Tm and Y3 is scaled proportional to Tm 
all three outputs will have the same power. 

If @,,...,@,, are the orthogonal unit eigenvectors of a real symmetric matrix M, then 


the system of equations 


M¢, = 14, 
can be compactly written as 
MU = DA. (5.5-7) 


The next theorem establishes a relation between the eigenvalues of an r.s. matrix and its 
positive definite character. 
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Theorem 5.5-5 A real symmetric matrix M is positive definite if and only if all its 
eigenvalues are positive. 


Proof First let \; > 0,71 =1,...,n. Then with the linear transformation x = Uy we 
can write for any vector x 


x’ Mx = (Uy)’M(Uy) 


=S> ry; > 0 (5.5-8) 


unless y = 0. But if y = 0, then from x = Uy, x = 0 as well. Hence we have shown that M 
is p.d. if A; > O for all 7. Conversely, we must show that if M is p.d., then all A; > 0. Thus, 
for any x £0 

0 <x!?Mx. (5.5-9) 


In particular, Equation 5.5-9 must hold for @,,...,@,,. But 
0<@7M¢,=%, i=1,...,n. 


Hence \; > 0,7 = 1,...,n. Thus, a p.d. covariance matrix K will have all positive eigen- 
values. Also since its determinant det(K) is the product of its eigenvalues, det(K) > 0. 
Thus when K is full-rank, it is p.d. 


Whitening Transformation 


We are given a zero-mean n xX 1 random vector X with positive definite covariance 
matrix Kx x and wish to find a transformation Y = CX such that Kyy = I. The 
matrix C is called a whitening transform and process of going from X to Y is called a 
whitening transformation. Let the n unit eigenvectors and eigenvalues of Kxx be denoted, 
respectively, by @;,Ai,2 = 1,---,n. Then the characteristic equation Kxx@; = \:0;,1 = 
1,---,n can be compactly written as KxxU = UA, where U = [d, bo --: d,] and A = 
diag(A1, A2,--: , An). Since Kxx is p.d., all its eigenvalues are positive and the matrix 
A-1/2 & diag(1/VMi, 1/V/2,-++,1/VXn) exists and is well defined. Now consider the trans- 
formation Y= CX = A~'/?U?X. Then 


Kyy = E[YY"7] = E[CKX7C™] = A71?UTE[XX7]UA71? = A-12UTKxx UA7}/? 
= A“V2UT(KxxU)A-Y? = A-V2UT(UA)A? = A-V2(UTU)AA“¥? = A-V?2 
AAW'/? =I, since UTU =I. 


Example 5.5-3 
(whitening transformation) In Example 5.5-2 we considered the random vector X with 
covariance matrix and eigenvector matrices, respectively 
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2-11 0 1/V2 1/V2 
Kxx = |-1 2 0] U= ]1/V2 -1/2 1/2 | =U" 
102 ij/2 1/2 -1/2 


with 
2 O 0 
A=1]0 24+vV2 0 
0 0 2—/2 

Then 


1/2 0 0 0 I/v2 1/v2 
Y=/| 0 @+2)- 0 1/2 1/2. 1721 x 
0 0 (2—/2)-/?| |1//2 1/2 -1/2 


is the appropriate whitening transformation. Whitening transformations are especially useful 
in the simultaneous diagonalization of two covariance matrices.! 


5.6 THE MULTIDIMENSIONAL GAUSSIAN (NORMAL) LAW 


The general n-dimensional Gaussian law has a rather forbidding mathematical appearance 
upon first acquaintance but is, fortunately, rather easily seen as an extension of the one- 
dimensional Gaussian pdf. Indeed we already introduced the two-dimensional Gaussian pdf 
in Section 4.3 but there we did not infer it from the general case. Here we consider the 
general case from which we shall be able to infer all special cases. We already know that if 
X is a (scalar) Gaussian RV with mean yp and variance o?, its pdf is 


fx(x) = ey ox (-3 )) 


First, we consider a random vector X = (X1,...,Xn)! with independent components X;, 
i = 1,...,n, each distributed as N(;,0;7). Then the pdf of X is the product of the 
individual pdf’s of X1,..., Xn, that is, 


fx(ai,.--,n) = |] fx.(s) 
i=l 


1 | eee een? 7 
= a . -1 
(29)"/201 ...0n oe 2 » ( on ) | ae 


4=1 


*Such diagonalizations occur in a branch of applied probability called pattern recognition. In particular, 
if one is trying to distinguish between two classes of data, it is easier to do so when the data are represented 
by diagonal covariance matrices. 


320 Chapter 5 Random Vectors 


where ju;, 7? are the mean and variance, respectively, of X;,i = 1,...,n. Equation 5.6-1 


can be written compactly as 


1 


fx(x) = (27)"/2|[det(Kxx)]!/72 exp[ 3(x b)Kxx(x — p)), (5.6-2) 
where 
a 0 
Kxx = - (5.6-3) 
0 o. 


n 


b= (Hy,---;[y)?, and det(Kxx) = []}_, ?. Note that Kx is merely 


he 0 
Kxx = 
0 o;,” 
Note that because the X;, i = 1,...,n are independent, the covariance matrix Kxx is 
diagonal, since 
El(Xi-—,)?]20?2 i=1,...,0. (5.6-4) 
BUXi—u(Xj—-w)=0 ii. (5.6-5) 


Next we ask, what happens if Kxx is a positive definite covariance matrix that is not 
necessarily diagonal? Does Equation 5.6-2 with arbitrary p.d. covariance Kx x still obey 
the requirements of a pdf? If it does, we shall call X a Normal random vector and fx (x) 
the multidimensional Normal pdf. To show that fx(x) is indeed a pdf, we must show that 


fx(x) 20 (5.6-6a) 


and 
/ “ fx(x)dx =1 (5.6-6b) 


(We use the vector notation dx = dx, dxz... dx», for a volume element.) We assume as 
always that X is real; that is, X),...,X, are real RVs. To show that Equation 5.6-2 with 
arbitrary p.d. covariance matrix Kxx satisfies Equation 5.6-6a is simple and left as an 
exercise; to prove Equation 5.6-6b is more difficult, and follows here. 


Proof of Equation 5.6-6b when fx(x) is as in Equation 5.6-2 and Kxx is an 


arbitrary p.d. covariance matrix. We note that with z = pt, Equation 5.6-2 can 
be written as 

A 1 

~ (2r)"/2 [det (Kxx)| 


fx (x) 1/2 $(z), 
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where 
(2) 2 exp(—42T Kxh2). (5.6-7a) 
With 
at f- $(z)dz, (5.6-7b) 
we see that a 
ie fx (x)dx = Gayla 


Hence we need only evaluate a to prove (or disprove) Equation 5.6-6b. 

From the discussion on whitening transformation we know that there exists an n x n 
matrix C such that Kxx = CC? and CK, C = I (the identity matrix). Now consider 
the linear transformation 

Z= Cy (5.6-8) 


for use in Equation 5.6-7a. To understand the effect of this transformation, we note first 
that 


n 
2° KZ = y’ CTKXXCy = |lyll? = doy? 
i=1 
so that @(z) is given by 


(2) = |] expl-402. 


Next we use a result from advanced calculus (see Kenneth Miller, [5-5, p. 16]) that for a 
linear transformation such as in Equation 5.6-8 volume elements are related as 


dz = |det(C)|dy, 


where dz 2 dz,...dZ, and dy = dy... dyn. Hence Equation 5.6-7b is transformed to 


[oe) 1 n 
a= | exp (2%) dy, ... dyn| det(C)| 
ia i=1 


7 if ev? au) | det (C)| 
= [2n]"/?| det(C)|. 


But since Kxx = CC’, det(Kxx) = det(C) det(C’) = [det(C)]? or 
| det(C)| = |det(Kxx)|!/? = (det(Kxx))/?. 


Hence 
a = (2n)"/*[det(Kxx)]!/? 
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and 
a 


= 1, 
[27]”/2[det (Kxx)]1/2 , 
which proves Equation 5.6-6b. [J 


Having established that 


1 
(2Q7r)n/2 [det(Kxx 


ix(x) = ye exp(—3(x — #) "Kx x(x — 4) (5.6-9) 


indeed satisfies the requirements of a pdf and is a generalization of the univariate Normal 
pdf, we now ask what is the pdf of the random vector Y given by 


y 2 Ax, (5.6-10) 


where A is a nonsingular n x n transformation. The answer is furnished by the following 
theorem. 


Theorem 5.6-1 Let X be an n-dimensional Normal random vector with positive 
definite covariance matrix Kxx and mean vector w. Let A be a nonsingular linear trans- 


ee : . A . : ; : 
formation in n dimensions. Then Y = AX is an n-dimensional Normal random vector with 
: : A 
covariance matrix Kyy = AKxx A’ and mean vector 9= Ap. 


Proof We use Equation 5.2-11, that is, 


fyry)=>5 Flos) (5.6-11) 


where Y is some function of X, that is, Y = g(X) = (g1(X),...,9n(X))", the x;, i = 
1,...,r, are the roots of the equation g(x;) — y = 0, and J; is the Jacobian evaluated at 
the ith root, that is, 


Ox, Oxy 
J; = det ($2) =| : : : (5.6-12) 
Ox, On, x=X; 


Since we are dealing with a nonsingular linear transformation, the only solution to 
Ax—y=0 is x=A"y. (5.6-13) 


Also 


J; = det (A) = det(A). (5.6-14) 
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Hence 


fy(y) = : 


(2n)"?[det(Kxx)]/?| det(A)] 


exp(—3(A7ly — #)"Kxx(A“ly — p)). (5.6-15) 
Can this formidable expression be put in the form of Equation 5.6-9? First we note that 
[det(Kxx)]*/?| det(A)| = [det(AKxxA7)]!/?. (5.6-16) 


Next, factoring A inverse out of the first and last factors, and combining these terms with 
the inverse covariance matrix, we obtain 


(Avty — w)"KXX(AT'y — w) = (y — Ap)? (AKxxA’) ly — Ap). (5.6-17) 
But Ap © B = E[Y] and AKxxA? = E|(Y—8)(Y—8)"] = Kyy. Hence Equation 5.6-15 


can be rewritten as 


fy(y) = ; 


(27)"/?[det(Kyy)] 


vz expl-a(y — 8)" Kyy(y—4)]- (5.6-18) 


The next question that arises quite naturally as an extension of the previous result is: 
Does Y remain a Normal random vector under more general (nontrivial) linear trans- 
formation? The answer is given by the following theorem, which is a generalization of 
Theorem 5.6-1. 


Theorem 5.6-2 Let X be an n-dimensional Normal random vector with positive 
definite covariance matrix Kxx and mean vector p. Let Am, be an m x n matrix of 
rank m. Then the random vector generated by 


Y= AmnX 
has an m-dimensional Normal pdf with p.d. covariance matrix Kyy and mean vector 3 
given, respectively, by 
A 
Kyy = AinnKxxAZ,, (5.6-19) 


and 


B=AnnuL (5.6-20) 
The proof of this theorem is quite similar to the proof of Theorem 5.6-1; it is given by Miller 
in [5-6, p. 22]. 
Some examples involving transformations of Normal random variables are given below. 


Example 5.6-1 
(transforming to independence) A zero-mean Normal random vector X = (X1,X2)" has 
covariance matrix Kxx given by 


3. -1 
ees 
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Find a transformation Y = CX such that Y = (Yi, Y2)" is a Normal random vector with 
uncorrelated (and therefore independent) components of unity variance. 


Solution Write 
E\YY*] = E[CXX*C?] = CKxxC? =1. 


The last equality on the right follows from the requirement that the covariance of Y, Kyy, 
satisfies 


From the previous discussion on whitening, the matrix C must be C = A~!/?U7, where 
A~1/? is the normalizing matrix 


A712 A ps 0 


‘ | (A;,i = 1,2 are eigenvalues of Kxx) 
2 


and U is the matrix whose columns are the unit eigenvectors of Kxx (recall U1 = U7). 
From det(Kxx — AI) = 0, we find A; = 4, Az = 2. Hence 


1 
= 0 
a |. 2 -1/2_ | 2 
tei AOS) x 
J2 
Next from 
(Kxx —A1]¢, =0, with ||,|| =1, 
and 


(Kxx — AD)d, =0, with |/@,||=1, 
we find d, = (1/V2,—-1V2)", oy = (1/V2,1V2)". Thus, 


U=(r6)= [it 


and 1 


iL 
V2 
As a check to see if CKxxC” is indeed an identity covariance matrix, we compute 
1 1 

2 | 3 3] 2 a E | 
1 -1 3 1 ~ 10 14 


1 
2 
1 
v2 V2 “2 


eNle 
le 


1 


alr Sle 
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In some situations we might want to generate correlated samples of a random vector X whose 
covariance matrix Kxx is not diagonal. From Example 5.6-1 we see that the transformation 


x=C"*Y, (5.6-21) 


where C = ZU" produces a Normal random vector whose covariance is Kxx. Thus, one 
way of obtaining correlated from uncorrelated samples is to use the transformation given in 
Equation 5.6-21 on jointly independent computer-generated samples. This procedure is the 
reverse of what we did in Example 5.6-1. 


Example 5.6-2 
(correlated Normal RVs) Jointly Normal RVs X, and X2 have joint pdf given by (See 
Equation 4.3-27 and the surrounding discussion in Section 4.3.) 
= 1 ~l 2 2 2 
fx, Xx2(@1,%2) = Wy eer ed Ii py putz + x3) }. 
Let the correlation coefficient p be —0.5. From X,, X9 find two jointly Normal RVs Y; and 
Y2 such that Y, and Y are independent. Avoid the trivial case of Y; = Y2 = 0. 


Solution Define x 5 (v1, %2)? and y = (y1, ye)’. Then with p = —0.5, the quadratic in 
the exponent can be written as 


b 


2 2 T|a@ 
{+ X%g+%X% =X /° d 


| x = ax? + (b+c)r122+ dz, 

where the a, b, c, d are to be determined. We immediately find that a = d = 1 and—because 
of the real symmetric requirement—we find b = c = 0.5. We can rewrite fx, x,(%1, 2) in 
standard form as 


= 1 1, papi 
Pax (x1, ©) _ In|det(Kxx)]!/2 exp ( 5 (x Kxi)) ’ 
whence 
Kol = 1 a b = ee 1 0.5 
XX o2(1—p2) |e d}] 302/05 1 |’ 
Our task is now to find a transformation that diagonalizes 1 em This will enable the joint 
pdf of Y; to Y2 to be factored, thereby establishing that Y; and Y2 are independent. 


The factor 4/30? affects the eigenvalues of Kx but not the eigenvectors. To compute 
a set of orthonormal eigenvectors of Keys we need only consider Kee given by 


p—-1 A 1 0.5 

ee Ee 1 | 
for which we obtain \y = 3/2, \2 = 1/2. The corresponding unit eigenvectors are @; = 
(1/V2)(1, 1) and @ = (1/V2)(1, -1)7. Thus with 


~ A |1 1 
ee|t 
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(normalization by 1/\/2 is not needed to obtain a diagonal covariance matrix so we dispense 
with these factors) we find that 


U' KU =diae(3, 1). 


Hence a transformation that will work ist 


Y =U'X, 
that is, 
Y, = X,4+ Xo 
Yo = X1 — Xo. 


To find fy,y,(y1, Y2) we use Equation 3.4-21 of Chapter 3: 
Fyive(1, y2) = ye fx, x2(xi)/[Jil, 
i=1 


where the x; 4 (2, at, i=1,...,n, are the n solutions to y — U?x = 0 and J; is the 
Jacobian. There is only one solution (n = 1) to y — U7 x = 0, which is 


>i _— yt ye 
. 2 

i _ Yi— Y2 
7 2 


and, dispensing with subscripts there being only one root, 


_ Og\ _ 1 1} _ 
yaaa (28) aae[! 2] =-2 


Hence 


1 Yi ty2 Yr— Ye 
fyive(y1 Y2) = gfx X2 ( 5} ’ 2 


_ 1 exp Yr}. 1 exp Y3 
V 210? 207 | V2roP 2af?|” 


where o’ & V/30. 


+ There is no requirement to whiten the covariance matrix as in Example 5.6-1. Also, diagonalizing Leen 
is equivalent to diagonalizing Kxx. 
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Examples 5.6-1 and 5.6-2 are special cases of the following theorem: 


Theorem 5.6-3 Let X be a Normal, zero-mean (for convenience) random vector with 
positive definite covariance matrix Kxx. Then there exists a nonsingular n x n matrix C 
such that under the transformation 


Y=c"xX, 
the components Yj,...,Y;, of Y are independent and of unit variance. 
Proof Let Cot =A-!/2U"; 
then Kxx =CC’. @ 


Example 5.6-3 
(generalized Rayleigh law) Let X = (X1, X2,X3)? be a Normal random vector with covari- 
ance matrix 


Kxx => o°I. 


Compute the pdf of R3 = ||X|| = /X? +X2 +X2. 


Solution The probability of the event {R3 <r} is the CDF Fr,(r) of R3. Thus, 


1 1 9 2 2 
FR, (r) = aman || fo spate! +a@5+2%5)| dx; dx dx3, 
GZ 


where ge {(x1,@2,23): \/a? + 22 + 22 <r}. Now let 
Ly 4 gcos¢ 
x2 4 ésin cos 
v3 = €singsin#g, 


that is, a rectangular-to-spherical coordinate transformation. The Jacobian of this trans- 
formation is €? sing. Using this transformation in the expression for F’ Ra(r), we obtain 
for r >0 


1 Ve 27 wT e? gt 
Frult)= apemcaay hg dh-o [aor] sino eae 


An r e 
= CSTnEE sp 3 ~ 


To obtain fr,(r), we differentiate Fr,(r) with respect to r. This yields 


fr(r) = TOIL exp |-| -u(r), (5.6-22) 
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where u(r) is the unit step and ['(3/2) = /7/2. Equation 5.6-22 is an extension of the 
ordinary two-dimensional Rayleigh introduced in Chapter 2. The general n-dimensional 
Rayleigh is the pdf associated with R, & ||X|| = /X?+...+ X? and is given by 


frag(t) = | ue) (5.6-23) 
r)= ex! -u(r). .6- 
Ri T (8) [202]”/2 Pl” 962 

The proof of Equation 5.6-23 requires the use of n-dimensional spherical coordinates. Such 
generalized spherical coordinates are well known in the mathematical literature [5-5, p. 9]. 
The demonstration of Equation 5.6-23 is left as a challenging problem. 


5.7 CHARACTERISTIC FUNCTIONS OF RANDOM VECTORS 
In Equation 4.7-1 we defined the CF of a random variable as 
&x(w) & Ele*], 


The extension to random vectors is straightforward. Let X = (X1,...,Xn)? be a real n- 
component random vector. Let w = (w1,...,Wn)” be a real n-component parameter vector. 
The CF of X is defined as 

Bx (w) 2 Ele"), (5.7-1) 


The similarity to the scalar case is obvious. In the case of continuous random vectors, the 
actual evaluation of Equation 5.7-1 is done through 


Bx (w) = / 7 fx(x)el@” *dx. (5.7-2) 


In Equation 5.7-2 we use the usual compact notation that dx = dzx,... dz, and the integral 
sign refers to an n-fold integration. If X is a discrete random vector, ®x(w) can be computed 
from the joint PMF as 


+00 
®x(w) = > Px (x), (5.7-3) 


where the summation sign refers to an n-fold summation. 

In both cases, we see that ®x(w) is, except for a sign reversal in the exponent, the n- 
dimensional Fourier transform of fx(x) or Px(x). This being the case, we can recover for 
example the pdf by the inverse n-dimensional Fourier transform (again with a sign reversal). 
Thus, 


1 - —jwlx 
fx(x) = (On)P - Px (wie dw. (5.7-4) 


The CF is very useful for computing joint moments. We illustrate with an example. 
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Example 5.7-1 


(finding mized moment) Let X 
E[X1X2X3]. 


= (X1, X2,X3)7 and w 4 (w1,W2,w3)?. Compute 


Solution Since 
co Co co 
Ox (w1, w2, w3) =| / / fx(@1, Lp, 03) ei Wri tw2r2twars] dx, dxz dx3, 
—oo J —co J—co 


we obtain by partial differentiation 


1 OP Ox (w1, we, W3) 


oh Ow 0w20w3 


W1=wW2=w3=0 


co co co 
=) / / 110903 fx (11, 02,3) dx, dx. dx3 
=o =O —co 


A 
2 E[X,X2X3]. 


Any moment—provided that it exists—can be computed by the method used in 
Example 5.7-1, that is, by partial differentiation. Thus, 


Obit. +hn By (w1,...,Wn) 


k kn) — :—-(kit...tkn 
BP =9 Aw... Owke 


(5.7-5) 


W1=...=W7=0 


By writing 


Elexp(jw? X)] = E =E 


exp (soa Tesetin.n) 
i=l i=l 


and expanding each term in the product into a power series, we readily obtain the rather 
cumbersome formula 


@x(w) = D>... So EXE... Xho] my on (5.7-6) 


which has the advantage of explicitly revealing the relationship between the joint CF and 


the joint moments of the X;,72 = 1,...,n. Of course Equation 5.7-6 has meaning only if 
BUX Xx] 
exists for all values of the nonnegative integers k,,...,k,, and when the power series 


converge. 
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From Equation 5.7-2 observe the important CF properties: 


Properties of CF of Random Vectors 


1. |®x(w)| < ®x(0) = 1 and 
2. ®X(w) = ®x(—w) (« indicates conjugation). 
3. All CFs of subsets of the components of X can be obtained once ®x(w) is known. 


The last property is readily demonstrated with the following example. Suppose 
X = (X,, Xo, X3) has CFT Ox (wy, we,w3) = Elexp j(w1X1 + woX2 + w3X3)]. Then 
® x, x2 (W1,W2) = Bx, x2x5(W1, 2, 0) 
® x, x, (W1,W3) = Ox, x, x5 (W1,0,w3) 
®x, (wi) = ®x, x, x3 (W1,0,0). 


As pointed out in Chapter 4, CFs are also useful in solving problems involving sums of 
independent RVs. Thus, suppose X = (Xj,...,Xn)", where the X; are independent RVs 
with marginal pdf’s fx,(a;),i = 1,...,n. The pdf of the sum 


Z= Xt... t+ Xy 
can be obtained from 
fa(z) = fx,(z) *...* fx, (2). (5.7-7) 


However, the actual carrying out of the n-fold convolution in Equation 5.7-7 can be quite 
tedious. The computation of fz(z) can be done more advantageously using CFs as follows. 
We have 


z(w) = Eee (X14. +Xn)y 


= if ®x,(w). (5.7-8) 


In this development, line 2 follows from the fact that if X1,...,X, are n independent 
RVs, then Y; = gi(X;), 7 = 1,...,n, will also be n independent RVs and E[Y,...Yn] = 
E|Y,|...E/Y,]. The inverse Fourier transform of Equation 5.7-8 yields the pdf fz(z). This 
approach works equally well when the X; are discrete. Then the PMF and the discrete 
Fourier transform can be used. We illustrate this approach to computing the pdf’s of sums 
of RVs with an example. 


Example 5.7-2 
(i.i.d. Poison CF) Let X = (Xj,...,Xn)", where the X;, i = 1,...,n are iid. Poisson 
RVs with Poisson parameter A. Let Z = X, +...+ X,. Then the individual PMFss are 


TWe use ®x(-) and ®x, x, x4(-) interchangeably if X = (X1, X2, X3)7. 
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rem" 
Px,(k) = (5.7-9) 
and 
MF exp(jwk) _ 
Bx,(u) = So PHB) > 
k=0 : 
= ¢rlexp(ju)—1) (5.7-10) 
Hence, by independence we obtain 
®z(w) = II ed(exp(jw)—1) 
i=1 
= e™Mlexp(iu)—1) (5.7-11) 


Comparing Equation 5.7-11 with Equation 5.7-10 we see by inspection that ®z(z) is the 
CF of the PMF 


Pz(k) =———, k=0,1..., (5.7-12) 


where a 2 nd. Thus, the sum of n i.i.d. Poisson RVs is Poisson with parameter ni. 


The Characteristic Function of the Gaussian (Normal) Law 


Let X be a real Gaussian (Normal) random vector with nonsingular covariance matrix Kxx. 
Then from Theorem 5.6-3 both Kxx and Kye can be factored as 


Kxx = CC” (5.7-13) 
Kx, =DD?, D42/[C7)"}, (5.7-14) 


where C and D are nonsingular. This observation will be put to good use shortly. The CF 
of X is by definition 


1 ve 1 “1 . 
Bx) = oa | exw (—30e— wk — 1) expe") dx 
(5.7-15) 


Now introduce the transformation 
z2D"(x—p) (5.7-16) 


so that 


= (x — w)KXX(x — H). (5.7-17) 
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The Jacobian of this transformation is det(D’) = det(D). Thus under the transformation 
in Equation 5.7-16, Equation 5.7-15 becomes 


- exp(juo™ u) 7 4 ee 
x) = Gaara aa | oP (38%) emo" (D)-"2 de 


(5.7-18) 
We can complete the squares in the integrand as follows: 
exp[—{5|27z — 2jw7(D*)~*z]}] = exp(—jw?(D*)~*(D)“*w) 
-exp(—$||z — jD~1w||?). (5.7-19) 


Equations 5.7-18 and 5.7-19 will be greatly simplified if we use the following results: (a) If 
Kxx = DD’, then Kxx = [D7]-!D7}; (b) det(K x4) = det(D) det(D7) = [det(D)]? = 
[det(Kxx)]~!. Hence | det(D)|~! = [det(Kxx)]'/?. It then follows that 


s 1 By a 
®x(w) = exp (10 50" Kxxw } Gar i eo 3llz-JGD WI? ay 


Finally we recognize that the n-fold integral on the right-hand side is the product of n iden- 
tical integrals of one-dimensional Gaussian densities, each of unit variance. Hence the value 
of the integral is merely (27)"/?, which cancels the factor (27)~"/? and yields the CF for 
the Normal random vector: 


dx (w) = exp[jw’ pw — $w? Kxxw, (5.7-20) 


where p is the mean vector, w = (w1,...,Wn)", and Kxx is the covariance. We observe in 
passing that ®x(w) has a multidimensional complex Gaussian form as a function of w. Thus, 
the Gaussian pdf has mapped into a Gaussian CF, a result that should not be too surprising 
since we already know that the one-dimensional Fourier transform maps a Gaussian function 
into a Gaussian function. 

Similarly the joint MGF for a random vector X = (Xj,..., Xn)" is defined as 


sete) 
= 


k= Okg= 0 kn= of 


Mx(t 


ae 
= SUE Pie ogee s. 


from which joint moments can be computed analogously to the CF case. 


SUMMARY 


In this chapter we studied the calculus of multiple RVs. We found it convenient to organize 
multiple RVs into random vectors and treat these as single entities. We found that when i.i.d. 
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random variables are ordered, many probabilistic results can be derived without specifying 
the underlying distributions. In Section 5.3, we derived, among others, the distribution of 
probability area (the area under the pdf between order samples) and the moments of such 
probability areas. We shall see in subsequent chapters that ordered random variables play 
important roles in a branch of statistics called distribution-free or robust statistics. Because 
in practice it is often difficult to describe the joint probability law of n RVs, we argued 
that in the case of random vectors we often settle for a less complete but more available 
characterization than that furnished by the pdf (PMF). We focused on the characterizations 
furnished by the lower order moments, especially the mean and covariance. In particular, 
because of the great importance of covariance matrices in signal processing, communication 
theory, pattern recognition, multiple regression analysis, and other areas of engineering and 
science, we made use of numerous results from matrix theory and linear algebra to reveal 
the properties of these matrices. 

We discussed the multidimensional Gaussian (Normal) law and CFs of random vectors. 
We demonstrated that under linear transformations Gaussian random vectors map into 
Gaussian random vectors. We showed how to derive a transformation that can convert 
correlated RVs into uncorrelated ones. The CF of random vectors in general was defined 
and shown to be useful in computing moments and solving problems involving the sums of 
independent RVs; these assertions were illustrated with examples. Finally, using vector and 
matrix techniques we derived the CF for the Gaussian random vector and showed that it 
too had a Gaussian shape. 


PROBLEMS 


(*Starred problems are more advanced and may require more work and/or additional 
reading.) 
5.1 Let fx(x) be given as 
f(x) = Ke Su(x), 
where A = (Aj,..-;An)? with A; > 0 for all i,x = (21,...,%n)7, u(x) = 1 if 2; >0, 
i=1,...,n, and zero otherwise, and K is a constant to be determined. What value 
of K will enable fx(x) to be a pdf? 


5.2 Let B;,i=1,...,n, be n disjoint and exhaustive events. Show that the CDF of X 
can be written as 


Fx(x) =) Fxjn,(x|Bi) P[Bil- 
i=1 
5.3 For —oo <4; <0o,t=1,2,...,n, let 


fx(x) = Paes ee on p (=)) } 


Show that all the marginal pdf’s are Gaussian. 
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5.17 


5.18 


5.20 


5.21 


Let X,, X2, X3 be three standard Normal RV’s. For 7 = 1,2,3 let Y; € {X 1, X2, X3} 
such that Y; < Yg < Y3 i.e. the ordered—by—signed magnitude of the X;. Compute 
the joint pdf fy, yas (yi, y2, ys). 

In Problem 5.4 compute the CDF Fr, (y), for i = 1,2,3 and plot the result. 

In Section 5.4 we introduced the RVs Z and Z,. Show that the joint pdf of Z, and 
Zp, iS given by Equation 5.3-7. 

Consider the RVs Vin 2 Zp — Z1, W = Zn. Show that the joint pdf fy,,W(v,w) = 
n(n —1)u"~?, for 0 << w-—u < w <1 and zero else. 

From the results of the previous problem, show that f,,,(v) = n(n —1)v"~?(1 — v), 
for 0<u<1,n> 2 and zero else. 

Show that the area under fz, z,73(21, 22, 23) = 3! with 0 < 21 < 22 < 23 < Lis unity. 
Compute the beta CDF for n = 2, 8=0;n=2, 6=0. 

Derive Equations 5.3-11, 5.3-12, 5.3-13. 

Use Excel or a similar computer program to generate curves of the beta CDF for 
n = 15, 20,30. Describe what seems to happening as n — co. 

Derive Equation 5.3-14. 

Show that, on the average, n ordered random variables divide that total area under 
fx (x) into n+ 1 equal parts. 

Show that any matrix M generated by an outer product of two vectors, that is, 
M = XX, has rank at most unity. Explain why R 2 E[XX*] can be of full rank. 
Let {X;,i = 1,...,n} be ni.i.d. observation on X and let {Y;,i = 1,...,n} be the 
associated order statistics. Show that Fy, (y) = F¥(y). 

Let {X;,i = 1,...,n} be n iid. observation on X and let {Y;,i = 1,...,n} be the 
associated order statistics. Show that Fy, = 1 — (1 — Fx(y))”. 

Let {X;,i = 1,...,n} be n iid. observation on X and let {Y;,i = 1,...,n} be the 


associated order statistics. Show that Fy,(y) = >>; {"} Fi(y)[1 — Fx(y)|™?. 


Show that the two RVs X; and X2 with joint pdf 
a, |m|<4, 2<a,<4 
= 16? 1 ’ 2 
Fx.Xa (1,22) { 0, otherwise 


are independent and orthogonal. 
Let X;,2 =1,...,n, be n mutually orthogonal random vectors. Show that 


= 2B [K.P]. 


n 


aes 


i=1 


E 


(Hint: Use the definition ||X||? 4 XX.) 


Let X;,2 = 1,...,n, be n mutually uncorrelated random vectors with means p,; 4 
E[X;]. Show that 


n 


SOK — H;) 


i=1 


2 n 
E =) 0 F [IX - sill] - 
w=1 
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5.22 Let X;,i =1,...,n, be n mutually uncorrelated random vectors with E[X;] = p,, 
i=1,...,n. Show that 


A 
where K; = E[(Xi — w;)(Xi — ;)"). 
5.23 Explain why none of the following matrices can be covariance matrices associated 
with real random vectors. 


2-4 0 4 0 0 6 14+ 7 2 4 6 2 
-4 3 1 0 6 O 1-j 5 -1 6 9 8 
oO b 2 0 O -2 2 —1 6 9 12 16 
(a) (b) (c) (d) 
5.24 (a) Let a vector X have E[X] = 0 with covariance Kxx given by 
=| 3. 42 
Find a linear transformation C such that Y = CX will have 
1 0 
Kev=[! 9) 


Is C a unitary transformation? 
(b) Consider the two real symmetric matrices A and A’ given by 


Ala b ,;Ala ov 
a de 


Show that when a = cand a! = c’, the product AA’ is real symmetric. More 
generally, show that if A and A’ are any real symmetric matrices, then AA’ 
will be symmetric if AA’ = A’A. 


(Kx. Fukunaga [5-8, p. 33].) Let K, and Kg be positive definite covariance matrices 
and form 
K = a,K, + a2Ko, where a1,a2 > 0. 


5.25 Let A be a transformation that achieves 
ATKA=I ATK,A= A = diag(A®,..., A). 


(a) Show that A satisfies 
K71K,A = AA”), 
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(b) Show that ATK2A & A) is also diagonal, that is, A® 2 diag(A®,...., 
AG). 

(c) Show that A7K,A and A’K,A share the same eigenvectors. 

(d) Show that the eigenvalues of A”) are related to the eigenvalues of A as 


1 
\M®? = —[1 — arf] 


u ag 


and therefore are in inverse order from those of A“). 


5.26 (J. A. McLaughlin [5-9].) Consider the m vectors KX; = (Xi1,..-, Xin)7,7=1,... 
where n > m. Consider the n x n matrix S = + 0", X;X/. 


(a) Show that with W 4 (X1...Xm), S can be written as 


S= 1 wwr. 


m 


(b) What is the maximum rank of S? 


(c) Let S’ & 1 WwW. What is the size of S’? Show that the first m nonzero 
eigenvalues of S can be computed from 


S’b = GA, 


where ® is the eigenvector matrix of S’ and A is the matrix of eigenvalues. 
What are the relations between the eigenvectors and eigenvalues of S and 8’? 
(d) What is the advantage of computing the eigenvectors from S’ rather than S? 


5.27 (a) Let K be an n Xx n covariance matrix and let AK be a real symmetric 
perturbation matrix. Let \;, i = 1,...,n, be the eigenvalues of K and @, 
the associated eigenvectors. Show that the first-order approximation to the 
eigenvalues X;, of K + AK yields 

M= 6) (K+AK)b, i=]... 


gt: 


(b) Show that the first-order approximation to the eigenvectors is given by 


where bj; = 6; AK@;/(\i — Aj) tA j and by = 0. 

5.28 Let Ay > Ag >... = An be the eigenvalues of a real symmetric matrix M. For 
i > 2, let 1, ¢9,..-,@;_1 be mutually orthogonal unit eigenvectors belonging to 
A1,---,Ai-1. Prove that the marimum value of u’Mu subject to ||u|| = 1 and 
u oy =... 0", , = 0 is Ay, that is, A; = max(u? Mu), 
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5.29 Let X = (X1, X2, X3)? be a random vector with uw S E|X] given by 


p= (5,—5,6)7 
and covariance given by 
5 2 -1 
K=] 2 5 0O 
-1 0 4 


Calculate the mean and variance of 
Y=A'X+B, 


where A = (2,—1,2)7 and B=5. 
5.30 Two jointly Normal RVs X; and X2 have joint pdf given by 


2 
fx. x2 (#1, £2) = ia 8 (a7 + 32122 + 73)). 


Find a nontrivial transformation A in 


such that Y; and Y are independent. Compute the joint pdf of Y,, Yo. 
5.31 Show that if X = (Xi,...,Xn)7 has mean p = (p1,...,,)7 and covariance 


K = {Kij}nxn, 
then the scalar RV Y given by 
A 
Y =p X41 SF eas + prnXn 


has mean 


ElY] = SS pitty 
i=1 
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5.32 


5.33 


5.34 


5.35 


5.36 


5.37 


5.38 


and variance 


n n 
ay => pipe. 


i=1 j=1 


Compute the joint characteristic function of X = (X1,...,Xn)", where the X;,i = 
1,...,n, are mutually independent and identically distributed Cauchy RVs, that is, 


a 
fx; (2) _ n(x? +4 a?) 

Use this result to compute the pdf of Y = $7", Xj. 

Compute the joint characteristic function of K = (X1,...,Xn)", where the X;, 
i = 1,...,n, are mutually independent and identically distributed binomial RVs. 
Use this result to compute the PMF of Y = 37", Xj. 

Let X = (Xj,...,X4) be a Gaussian random vector with E[X] = 0. Show that 


EX) X_X3X4) = Ky2.K34 4+ Ky3Ko44+ KysKo3, 


where the K;; are elements of the covariance matrix K = {Kj;}4x.4 of X. 

Let the joint pdf of X1, X2, X3 be given by fx(x1, 22,23) = 2/3+ (a1 +42 + x3) over 
the region S = {(x1, 22,43) : 0 < a; < 1,i = 1,2,3} and zero elsewhere. Compute 
the covariance matrix and show that the random variables X1, X2, X3, although not 
independent, are essentially uncorrelated. 

Let X 1, X2 be jointly Normal, zero-mean random variables with covariance matrix 


2  -15 
=| 35 2 |. 


Find a whitening transformation for X = (X,X2)7. Write a MATLAB program to 
show a scatter diagram, that is, x2 versus x, where the latter are realizations of 
X 9, X1, respectively. Do this for the whitened variables as well. Choose between a 
hundred and a thousand realizations. 

(linear transformations) Let Y; = pee apjXj,k =1,...,n, where the az; are real 
constants, the matrix A = [a;;]..~ is nonsingular, and the {X;} are random vari- 
ables. Let B = A~!. Show that the pdf of Y, fy(yi,.--,Yn) is given by 


fy(Yi,--+;Yn) = | det B\ fx(aj,..., 2%), where x = S > bik ye fori=1,...,n. 
k=1 


(auxiliary variables) Let Y; = S°, X; and Y2 = >, X;. Compute the 
joint pdf, fy,y,(y1,y2), by introducing the auxiliary variables Y, = S7i_, Xi, 
kk = 3,...,n, and integrating over the range of each auxiliary RV. Show the 
fy(yi,---5Yn) = fx(yi— Y2,---;Yn—1 — Yn; Yn). (This problem and the previous 
are adapted from Example 4.9, p. 190, in Probability and Stochastic Processes for 
Engineers, C. W. Helstrom, Macmillan, 1984). 
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Statistics: Part 1 
Parameter Estimation 


6.1 INTRODUCTION 


Statistics, which could equally be called applied probability, is a discipline that applies 
the principles of probability to actual data. Two key areas of statistics are parameter esti- 
mation and hypothesis testing. In parameter estimation, we use real-world data to esti- 
mate parameters such as the mean, standard deviation, variance, covariance, probabilities, 
and distributions. In hypothesis testing we use real-world data to make rational decisions, 
if possible, in a probabilistic environment. We leave the topic of hypothesis testing for 
Chapter 7. 

We recall that probability is a mathematical theory based on axioms and definitions 
and its main results are theorems, corollaries, relationships, and models. While proba- 
bility enables us to model and solve a wide class of problems, the solutions to these prob- 
lems often assume knowledge that is not readily available in the real world. For example, 
suppose we are given that X:N(,07) and we wish to compute the probability of the event 
E = {-1< X < +41}. We do this easily and obtain Fsy((1 — u)/o0) — Fsn((-1 -— p)/o). 
However, in the real world how would we determine the parameters pz,o? For that matter, 
how would we even determine that this is a Gaussian problem? In earlier chapters we used 
important parameters such as wx the average or expected value of a random variable RV 
X; 0x, the standard deviation of X; 0%, the variance of X; E[XY], the correlation of two 
RVs X and Y; and others. We estimate these quantities in the real world using so called 
estimators, which are functions of RVs. What are the features of a good estimator? How 
do we choose among different estimators for the same parameter? What strong statements 
can we make regarding how “near” the estimate is to the true but unknown value? 


340 
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Much has been written about parameter estimation but the subject is not exhausted, as 
witnessed by the large number of research articles in the archival literature devoted to the 
subject. There are several excellent books (e.g., [6-1, 6-2]) on statistics and parameter esti- 
mation with an “engineering” flavor, and a plethora of expository material on the internet. 


Example 6.1-1 
(is a coin fair?) Suppose you are involved in a game involving a coin and you would like 
to know if the coin is fair. You flip the coin and observe a head (H). What conclusion can 
you come to? Other than concluding that the coin is not restricted from coming up heads 
there isn’t much else that you can conclude. Now you repeat the experiment and observe a 
tail (T) on the next toss. Can you conclude that the coin is fair? That would be a highly 
risky conclusion. Suppose that in 10 tosses you observe the sequence {H, T, T, H, H, T, H, 
T, H, T}. Based on the observations and using the frequency interpretation of probability, 
we might conclude that P[H] = ny/n = 5/10 = 0.5 and thus that the coin is fair but you 
still cannot be certain. On the other hand, if you observe the sequence {H, T, H, H, H, 
H, T, H, H, H} you might be tempted to conclude that the coin is biased toward coming 
up heads but even here you can’t be certain. Is there a quantitative way of describing our 
uncertainty (or certainty)? In what follows we introduce some ideas that will help to answer 
this question. 


Independent, Identically Distributed (i.i.d.) Observations 


In the coin tossing experiment described above, upon tossing a coin we can define a generic 


RV X as 
A { 1, if a head shows up, 


— 0, if a tail shows up. 


If we toss the coin n times, we define a sequence of RVs X;,i = 1,...,, which are called 
independent, identically, distributed (i.i.d.) observations. The collection of these i.i.d. obser- 
vations {X;;i =1,...,n} is called a random sample of size n from X. In some situations, X 
is more aptly called a population; but the set of observations on X is still called a random 
sample of size n. The X; in this example happen to be Bernoulli RVs but in general they 
could have any distributions as long as they all share the same CDF, pdf, or PMF and each 
observation is unaffected by the outcome of the previous distribution. 

We have already introduced the idea of i.i.d. RVs in connection with our discussion 
of the Central Limit Theorem but elaborate on them some more here because of their 
extraordinary importance in statistics. The observations are independent because, in this 
case, subsequent tosses are not influenced in any way by the outcomes of previous tosses or 
future tosses. More precisely, in terms of the joint probability mass function(s) (PMF) of 
DG aad ere 


Peis %,, (Bis Wag"? * Nn) = Px, (Bi) Px a) * Px, (ea): 


They are identically distributed because we are using the same coin in all the tosses and the 
coin is assumed unaffected by the experiment. More precisely: 


Px, (x) = Px,(x) = ++. = Px, (2) 2 Px(2), —0o <2 < 00. 
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When we deal with continuous random variables the property i.i.d. implies: 


FX Xoe-Xn (Zi, T2,°°° ;2n) = fx, (Li) ix, (x2) ot fx, (45) 
fx, (x) = fx, (#) =--» = fx, (2) 2 fx(x),-00 < @ < @. 


The idea of i.i.d. observations is counterintuitive for many readers. For example, a coin— 
judged fair by all physical and previous statistical tests—is tossed and comes up heads nine 
times in a row; surely some readers will expect the coin to come up tails on the tenth toss 
to “balance things out.” But the coin has no memory of its past history and on the tenth 
toss it is as likely to come up heads as tails.? 


Example 6.1-2 
(failure of identically distributed condition) We study the arrival rates of customers at a 
barbershop. To that end we partition the workday (7 am to 3 pm) into 16 half-hour intervals 
and count the number of arrivals in each interval. Let X;,i = 1,...,16, denote the number 
of arriving customers in the ith interval. Here the X; are not i.i.d. (failure of the “identically 
distributed” requirement). We expect more arrivals in the early morning, before people must 
report to their jobs, than at other times in the day except possibly during the lunch break. 


Example 6.1-3 
(biased random sampling) A breakfast food company that produces BranPellets™ cereal 
intends to show that eating BranPellets'™ will result in weight loss. To that end the company 
hires a pollster to poll those who have attempted to lose weight by eating BranPellets™. 
The pollster begins by randomly selecting from the pool of BranPellets™ eaters but when 
the results do not seem to confirm that eating BranPellets™ results in weight loss, the 
pollster confines the polling to the sub-group of people of average or less-than average 
weight. With X; denoting the weight loss of the ith person polled after three months of 
eating BranPellets™, we note that the set of {X;} obtained by fair polling are unlikely 
to be distributed by the same law as the set of {X;} obtained by biased polling. Inciden- 
tally, we could formulate this as a hypothesis testing problem by formulating the hypoth- 
esis that eating BranPellets™ will result in weight loss versus the alternative that eating 
BranPellets™ will not result in weight loss. 


Example 6.1-4 
(non-independent sequences) A conservative gambler plays n rounds of blackjack. He starts 
with a stash of $100 and bets only $1 at each round. Let X; denote the value of his stash at 
the ith play. Are the X;,7 = 1,...,n, an independent sequence? Clearly X;41; = X;+1 hence 
the X; are not mutually independent; for example, P[X; = 10, X;., = 12] = 0, although 
taken separately neither probability needs to be zero. Let Y; denote the gambler’s win (or 
loss) on the ith play. Then Y; = +1. Are the Y;,i = 1,...,m an independent sequence? The 
answer is yes! because the outcome of the ith play has no memory of the past or future and 
therefore cannot be affected by it. 


tHowever, if in a large number of tosses there are many more heads than tails, the assumption that 
the coin is fair needs to be re-examined. Here hypothesis testing (Chapter 7) is useful in making a stronger 
statement than the coin is “probably fair” or “probably unfair.” 

*Several assumptions are at play here, among them that the dealer plays fairly and that the gambler 
doesn’t change strategy as a result of his wins or losses. 
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Example 6.1-5 
(review of joint versus sum probabilities) We make three i.i.d. observations on a zero/one 
Bernoulli RV X and call these X1, X2,X3'. The PMF of X is Px(x) = p*q!~*,p+q=1, 
x =0,1. The joint PMF of the observations is 


_ ptiteotes 3—(21t+a2+e: 
Pee, (i tata) Sp eg ee, 


Note that this is different from the PMF of the sum Y 2 X 1 + Xo + X3, which is binomial 


Estimation of Probabilities 


Suppose that, based on observations, we estimate that the probability’ of an event E is 
P[E] = ng/n= 0.44. Here n is the sample size and ng is the number of times the event E is 
observed. How close is 0.44 to the “true” probability of the event? The “true” probability of 
an event is often beyond our means to acquire. Suppose a medical researcher wants to know 
the proportion (probability x 100) that his patient’s red blood cells are undersized. The true 
proportion could, hypothetically, be obtained by counting all the undersized cells among all 
red blood cells in the patient’s body and forming the ratio of the former to the latter. Of 
course this isn’t done. Nevertheless an excellent estimate can be obtained by counting the 
cells in few drops of blood. As another example, suppose one of the states in the United 
States has a county with 343,065 registered voters and 144,087 have voted Republican. Then 
the true probability that a person in this county, picked at random, has voted Republican 
is 0.42. However, the cost of polling 343,065 voters may be prohibitive (or impossible in the 
time allowed) and pollster may have to make predictions with much smaller random samples. 
Thus, suppose that pollsters do a random sampling of 512 voters and find that 225 voters 
have voted Republican. Then the estimated probability of Republican voters is 0.44. Notice 
that if the sample size is small enough, the estimate of Republican voters can be almost 
any number between zero and one. For example if we poll only two voters and they both 
voted Republican, our estimate of the probability of Republican voters would be one! But 
this estimate would be completely unreliable! On the other hand, if we could say something 
like “with a near-certain probability of 0.98 the estimated probability of a Republican voter 
is between 0.42 and 0.46” then we have would have made a “hard” statement about the 
percentage of Republican voters. The probability 0.98 is a hard number because we can be 
nearly certain that the percentage of Republican voters is between 42 and 44 percent. Thus, 
the estimated probability of Republican voters is a “soft” number in the sense that it is, 
typically, quite uncertain and becomes more so as the sample size decreases. In real life we 
would much prefer to make categorical statements about the reliability of estimates than 
offer estimates of uncertain reliability. 


+Note that these X; are discrete random variables. 
8We mentioned in Chapter 1 that in many if not most practical problems, probabilities have to be 
estimated. 
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One of the central goals of parameter estimation is to construct events that are (nearly) 
certain to occur, that is, events whose probability is a “hard” number. That is not to say 
that soft numbers such as the estimated probability p are necessarily unreliable or useless. 
For example, suppose that rental-car dealership handling thousands of cars finds that at 
the end of year 1, ng of its mn, cars must be replaced due to wear and tear. Then, all 
things being equal, if the agency starts year 2 with ng cars, it could reasonably expect that 
approximately pny cars will have to be replaced at the end of year 2, perhaps a few more, 


perhaps a few less. Here p 4 ng /n is the estimated probability that a car will have to be 
replaced by the end year 2. From the point of view of the executives of the company, the 
estimate png is useful for year 2 planning and budgeting. 

In Example 6.1-6 below we demonstrate how firm or hard conclusions can be drawn by 
applying basic principles of statistics. 


Example 6.1-6 
(estimating the number of fish in a lake) To illustrate how statistics can be used to generate 
meaningful certain events, consider the following problem. The United States Fish and 
Wildlife Services (FWS), a bureau of the Department of the Interior, is interested in esti- 
mating the percentage of freshwater bass in a large lake that for specificity we call Bass 
Lake. To that end, an “experiment” is performed where a net is used to capture a random 
sample of fish, which is subsequently examined for its bass content. In preparation for this 
experiment, we will denote the number of bass in the sample by ng and the fixed sample 
size by n. Then we form the estimator p = ng/n, which is a random variable because ng 
is a random variable'. We do not consider n a random variable because we can decide a 
priori how big a sample will be examined for its bass content. The true probability p that 
a fish pulled at random from the lake is a bass is the ratio of total bass in the lake to total 
fish in the lake; this number is unknown (and mildly variable over time since fish have a 
tendency to eat each other). At the risk of adding additional notation, we must carefully 
distinguish between the random variable ng (a function) and its realization, which is a 
number. Realizations, whenever they don’t add to confusion, will be superscripted with a 
prime. For example, a realization of ng might yield n', = 58, n = 133 and the estimated * 
probability that a fish selected at random will be a bass is jf’ = 58/133 = 0.44. The range 
of the function ng is the set of integers in the interval [0, n]. Of course, the realization 
p’ is only a one-time estimate of the true probability p that a fish will be a bass and we 
would like to make a stronger statement about the number of bass in Bass Lake. Suppose 
we examine the fish in the sample one-by-one. Let 


A j 1, if the ith fish is a bass, 
Xi= 
0, else. 


then X; is a Bernoulli RV with PMF Px,(x) = p*(1 — p)'~*, for « = 0,1 and zero 
else. The random sample {X;,7 = 1,...,n} consists of n i.i.d. observations on a generic 
random variable X, denoting whether a fish is a bass or not. We can think of X as a 


tHere and a few other places we briefly depart from our use of capital letters to denote random variables. 
tThe realization of an estimator is sometimes called an estimate, that is, a number. 
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population, that is, the fish population. The RV Z 2 > X; represents the total number 
i=l 

of bass in a sample of n fish and Z/n = p is the estimator for p. Since Z is the sum 

of independent Bernoulli RVs it is a binomial R.V with PMF 6(k;n,p) (Example 4.8-1). 

Then Z has mean np and standard deviation cy = \/np(1—>p). We next create the 


(almost) certain event EF = {np —3/np(l—p) < Z < np+3/np(1 =p)} and since Z 
is the sum of a large number of i.i.d. random variables we can use the Normal approxima- 
tion to compute P[E] as allowed by the Central Limit Theorem. Indeed this was done in 


Example 4.8-3, where P[E] was computed to be 0.997. We can rewrite P[E], using Z/n 4 D, 
as P[E] = P |(p— 9)? < 2p(1 — p)|=0.997. We suggest that the reader verifies this result. 
The argument is a quadratic in p and solving for the roots p1, p2 of (p—p)? = (9/n) p(1—p) 
will give the end points of the interval of integration about p that will yield an event 
probability of 0.997. These points are 


_ %+(9/n) _ | / 264+ (9/n) \’ Pe : 
PuP2= 914 (9/n)] | (ee) i+ @/m) - a 


For the numbers n = 133,ng = 58, we get p’ = 0.44 and find that p ~ 0.31, p5 ~ 0.57. 
How do we interpret these results? First note that there is no probability associated with 
the realized interval (0.31, 0.57]; it either contains the true probability p or not; its length 
is 0.26 and |p’ — p)| = |p’ — ps| ~ 0.13. The number 100 x |p’ — p/| is sometimes called the 
margin of error, which in this case is 13 percent’. The interval with end points [p1, D2] isa 
random interval because its end points are random variables; that is, they depend on the 
estimate p. However, on the average, the interval will enclose the point p in 997 times in a 
thousand trials. We note that while the percentage of the bass in the lake is nowhere near 
zero or 100 percent, the probability that bass make up between 31 and 57 percent of the 
fish in Bass Lake is a near-certain event! 

The above example illustrates how statistics has helped us to make a strong statement 
about the number of bass in Bass Lake. The statement might read like this: Research has 
shown that 44 percent of the fish in Bass Lake are bass. The margin of error is + 18 percent. 


Example 6.1-7 
(estimating dengue fever probability) A newspaper article* reported that inhabitants and 
visitors on the island of Key West in the State of Florida were being exposed to the virus 
that causes dengue fever. The illness is caused by the bite of a mosquito that carries the 
virus in its gut. While some in the island’s tourist industry minimized the likelihood that 
a visitor would be infected with the virus, an independent study found that among 240 
residents, presumably picked at random, 13 tested positive for the dengue fever virus. Some 
argued that the sample was too small to be accurate and that the dengue fever rate was 
much lower. Compute a 95 percent confidence interval on the true probability that a resident 
picked at random will test positive for dengue fever. 


TIt is not uncommon to describe the margin of error with an algebraic sign, for example in this case +13%. 
+The New York Times of July 23, 2010. 


346 Chapter 6 Statistics: Part 1 Parameter Estimation 


Solution Our estimator for the true mean is p = K/n where K is a Binomial random 
variable i.e. 


Px [k successes in n tries] S b(k;n, p) = @ eil—p), 


with E[p] = p, and Var[p] = p(1 — p)/n. From the data we compute the mean estimate as 
p’ =13/240=0.054. Since n >> 1, we use the Normal approximation to the Binomial and 
define the standard Normal random variable 


X 2 (6-p)/Vp—p)/n 


such that X:N(0,1). Then a 95 percent confidence interval on p is found from solving 
P(—20.975 <X< X0.975) = 2F'5n (20.975) —1=0.95 or x9.975 ¥ 1.96. Then 


p-p 
p(1 —p)/n 
Using the technique in Example 6.1-6, we find that the lower and upper limits of the 95 
percent interval on p, in this case, are the roots of the polynomial 1.016p? —0.124p+0.003 = 
0, which are p; = 0.033, py = 0.089. Thus we have a 95 percent confidence that the infection 


rate is from a low of 1 in 30 residents to a high of 1 in 11. Would this knowledge affect your 
plans to visit Key West? 


P[-1.96 < < 1.96] = 0.95. 


6.2 ESTIMATORS 


Estimators are functions of RVs that are used to estimate parameters but do not depend 
on the parameters themselves. We illustrate with some examples. 


Example 6.2-1 
(truth in packaging) A consumer protection agency (CPA) seeks to verify the information 
on the label of packages of cooked turkey breasts sold in supermarkets that says “70% meat, 
30% water.” The turkey breasts are produced by “Sundry Farms” and the CPA buys five 
“Sundry Farms” packages and checks for meat content percentage (mcp). With X; denoting 
the mep of the ith package, the CAP uses the function 0, = (1/n) oi, Xi to estimate the 
average mcp. It finds the following mcp’s in the five packages (n = 5) respectively: 68, 82, 
71, 65, 67 and obtains an average of 70.6 percent meat. 

The 70.6 percent represents a realization of the estimator ©, and is often called an 
estimate. If the CPA buys another set of five packages of cooked turkey breasts from “Sundry 
Farms,” it would no doubt compute a slightly different estimate from the previous. 


Example 6.2-2 
(truth in packaging continued) The CPA seeks to estimate the variability in 


the meat content of “Sundry Farms”turkey breasts. It uses the formula 0. = 
1/2 


(u/n are (x -(i/n) Ys x;) ) with n = 5 and obtains approximately 6.0 percent 


meat variability using the data in the previous problem. 
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Example 6.2-3 
(truth in packaging continued) The CPA is criticized for using © in the previous problem as 
a measure of variability. It is suggested that the CPA use instead the estimator 


63 = (caytn - 1)) Dh (% - (A/n) me) Using 63 with n = 5, the CPA 


computes a meat variability of 6.7 percent. 


In what follows we shall find that 0, is an unbiased and consistent estimator for the mean, 
0 is a biased, maximum likely estimator for the standard deviation, and 03 is an unbiased 
and consistent estimator for the standard deviation. Other estimators are used to estimate 
Var [X], the covariance matrix K and so on for the higher joint moments. 

Some estimators have more desirable properties than others do. To evaluate estimators 
we introduce the following definitions. 


Definition 6.2-1 An estimator! © is a function of the observation vector X = 
(X1,..., Xn)” that estimates @ but is not a function of 0. 


Definition 6.2-2 An estimator 0 for 0 is said to be unbiased if and only if E[O] = 0. 
The bias in estimating @ with O is* 


|E[0] — 6]. = 
Definition 6.2-3 An estimator 0 is said to be a linear estimator of 0 if it is a linear 
function of the observation vector X 4 (X1,..., Xn)’, that is, 
0 =b' xX. (6.2-1) 


The vector b is an n x 1 vector of coefficients that do not depend on X. 


Definition 6.2-4 Let ©, be an estimator computed from n samples X1,..., X, for 
every n > 1. Then ©,, is said to be consistent if 
lim P[|O,, — 6| > «] =0. for every e>0. (6.2-2) 
n— Ooo 


The condition in Equation 6.2-2 is often referred to as convergence in probability. 


Definition 6.2-5 An estimator O is called minimum-variance unbiased if 
E[(0 — 6)”] < E[(O’—6)"] m (6.2-3) 


where ©’ is any other estimator and E[0’] = E[O] = 0. 


Definition 6.2-6 An estimator @ is called a minimum mean-square error (MMSE) 
estimator if ; 
El(© — 6)”] < E[(O’ — 4)”), (6.2-4) 


where ©’ is any other estimator. [ 
+The validity of estimating parameters as well as other objects, for instance probabilities, from repeated 


observations is based, fundamentally, on the law of large numbers and the Chebyshev inequality. 
! The bias is often defined without the magnitude sign. In that case the lines could be positive or negative. 
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There are several other properties of estimators that are deemed desirable such as effi- 
ciency, completeness, and invariance. These properties are discussed in books on statistics! 
and will not be discussed further here. 


6.3 ESTIMATION OF THE MEAN 


In Chapter 4 we showed that the numerical average, u,, of a set of numbers is the number 
that is simultaneously closest to all the numbers 21, 72,...,%p in a set. In this sense jz, can 
be regarded as the best representative of the set. Borrowing from mechanics, some think of 
the average as the center of gravity of the set. While the sample average doesn’t tell the 
whole story, it is a useful descriptor for assessment in all sorts of situations. For example, if 
the average grade on a standardized test earned by students in School A is 92 and the average 
grade on the same test is 71 for students at School B, then, all other things being equal, 
one might conclude that School A does a better job of preparing its students than School 
B. If a large amount of data, suitably corrected for other factors (e.g., sex, income, race, 
lifestyle), showed that the average lifetime of smokers is 67 years while those of nonsmokers 
is 78 years, one could reasonably conclude that smoking is bad for your health. 
Repeating Equation 4.1-1 here with a slight change of notation, 


h(n) = oti (6.3-1) 


we observe that the numerical average depends on the size n of the number of the sample 
as well as the samples themselves. In our model we assume that the data are realizations of 
n i.i.d. observations on the generic random variable X; that is, x; is a one-time realization 
of the observation X1, #2 is a one-time realization of the observation X2, and so forth. Each 
of the X; is a function while x; is a numerical value that the function obtains. We create 
the mean-estimator function 


jix(n) = Xi (6.3-2) 


from the random sample {X1,...X,} to estimate the unknown parameter pix S E|X]. 
We recognize that jfix(n) is the estimator 0, introduced in Section 6.2. The object in 
Equation 6.3-2 is often called the sample mean. We use the hat to indicate that jix(n) is 
an estimator and not the actual mean. Incidentally, it is useful to introduce at this point 
the variance-estimator function (VEF) or the sample variance as 


n 


BR (n) 8 (Xi = x(n) (6.3-3) 


We recognize that VEF is the square of the estimator 03 in Section 6.2. This is one of two 
VEFs that are in common use. The other one is 


+See, for example,[6-1,Chapter 8]. 
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22 (Xi = fax (ny)? (6.3-4) 


We shall discuss the estimation of the variance in a later section but for now, we ask the 
reader’s indulgence to take Equation 6.3-3 at face value. The estimation of 7% by the VEF 
in Equation 6.3-3 is, as we shall see later, an entirely reasonable thing to do. Among other 
attractive features we find that E[a%| = o%, which is only asymptotically true for the VEF 
of Equation 6.3-4. 


Properties of the Mean-Estimator Function (MEF) 


The mean estimator given by Equation 6.3-2 is unbiased meaning that Eljix(n) — wx] = 0. 
The proof of this important result is easy. We write 


19x - 1 BX, = Px 3 is Xn = = jy. (6.3-5) 
t=1, i=1 


An unbiased estimator is often, but not always, desired'. Another and important property 
of an estimator is that, in some sense, it gets better as we make more observations. For 
example, we would expect the MEF in Equation 6.3-2 to be more “reliable” if it is based on 
100 rather than on 10 observations. One way to measure reliability is by way of the variance 
of the unbiased estimator. If the variance of the unbiased estimator is small, it is unlikely 
that a realization of jix(n) will be very far from the true mean jx; if the variance is large, 
the realization might often be far from the true mean. Consider the variance of jix(n). By 
definition this is 


Elfix(n)] = 


+E aooe px)(Xj = Hx) 


w=1 2 
igre : 1 n 
= 5 VOBUX - ux)14+ E((X;- ux)(Xj — ux)] 
t=1 i=l ifj 
= ox/n. (6.3-6) 


In line 1, the term on the right uses the unbiasedness of the MEF. Line 2 uses the definition 
of the MEF and multiplies and divides uxby n. Line 3 uses that the square of a sum is 


+One may tolerate a small amount of bias if the estimator has other desirable properties. 
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the sum of squares plus the sum of cross-term products with nonequal indexes. Line 4 uses 
the linearity of the expectation operator, and line 5 takes advantage of the fact that for 
i # j, X; andX; are independent and therefore E [(X; — px )(X; — wx)| = 0. At this point 
we invoke the Chebyshev inequality (see Section 4.4) and apply it to fiy(n). Then for any 
d>0 . 5 
Pllitx(n) — nx] > 6] < Ea) _ ox nog 

t) no 
Equations 6.3-6 and 6.3-7 are among the most important results in all of statistics. Equation 
6.3-6 says that the variance of the mean estimator decreases with increasing n and hence can 
be made arbitrarily small by choosing a large enough sample size. Specifically, the variance 
of the mean estimator is numerically equal to the variance of the observation variable 
divided by the sample size. This is true so long as the observation variable X has finite 
variance. Equation 6.3-7 says that the event that the absolute deviation between the true 
mean and the MEF exceeds a certain value—no matter how small that value is—becomes 
highly improbable when the sample size is made large enough. An estimator that obeys 
Equations 6.3-6 and 6.3-7 is said to be consistent. 


0. (6.3-7) 


Example 6.3-1 
(effect of sample size on estimating the mean) We wish to compute P[|/ix(n) — px| < 0-1] 
when X is Normal with ox = 3. To illustrate the effect of sample size we use two random 
samples: a small sample (n = 64) and a large sample (n = 3600). We write 


= Pl-Ol/njox < ¥ < 01 nex! 
= derf (Uv) 


oO 


= 2erf (0.0333 Yn) , 


where Y & (jtx —Lx)/(ox//n) is distributed as N(0,1). When n = 64, P [|fix(n) — px| < 
0.1] + 0.2. We can interpret this result as saying that in a thousand trials involving sample 
sizes of 64, in only about 200 outcomes will the mean estimate deviate from the true mean 
by 0.1 or less. For n = 3600, we compute P[|fix(n) — wx| < 0.1] ~0.95, which implies that 
the event {|/ix (nm) — wx| < 0.1} will occur in about 950 out of a 1000 trials. The implication 
is that in a single trial, the event {|fix(n) — ux| < 0.1} will almost certainly happen when 
n= 3600. 


Example 6.3-2 
(how many samples do we need to get a 95 percent confidence interval on the mean?) We 
want to compute a 95 percent confidence interval on the mean of a Normal random variable 
X. How many observations X1,...,X,, on X do we need? More to the point, what param- 
eters determinate the length and location of the interval? The terminology “95 percent 
confidence interval” merely means that we seek the end points of the shortest (or near- 
shortest) interval on the real line such that we expect that in 950 or so cases out of 1000 
the interval will enclose the true mean. In terms of a probability we write 


Pllitx(™) — Hxl S Yo.95] = 0.95, (6.3-8) 
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where the number Yo 95 is a number to be determined and its subscript reminds us that it 
is a 95 percent confidence interval we seek. We recall that jix(n) is N(ux,0%/n) so that 


— lds (6.3-9) 


is N(0,1). Then, rewriting Equation 6.3-8 with Y in mind, we obtain 


0.95 = Pl|—Yo.95 < x(n) — Lx < Yo.95] 
= P[-Yo.95Vn/ox SY < Yo.05Vn/ox].- (6.3-10) 
= 2Fsn(Yo.o5Vn/ox) — 1 


In line 2 we converted the RV jix(n) — fry into an N (0, 1) random variable Y. In line 
3 we expressed this probability in terms of the standard Normal CDF. The last line of 
Equation 6.3-10 yields the result we seek, that is F's n (7.95 /"/o7x)= 0.975. As on other 
occasions we use the symbol F'sy (zu) = u to denote the standard Normal (SN) CDF. 
The number z, is called the u-percentile of the standard Normal. From the tables of the 
CDF (see Appendix G) we find that 29.975 = 1.96. But since 29.975 = Yo.95/n/ox, we 
deduce that 79.95\/n/ox = 1.96 or, equivalently, 79.9, = 1.960x//n. Returning to the 
problem at hand, we note that the event {|jix(n) — wx| < Yo.95} is the same as the event 
{fix (2) — 70.95 < Ux < +fix(r)+Yo.95}. Then, from the middle line of Equation 6.3-10 we 
get that (on the average) a shortest 95 percent confidence interval for wx as 


[1.962% + fix(n), 1.967% + x(n) (6.3-11) 
n nm 


vn vn 


Of course this result can be generalized to other than 95 percent confidence intervals. 
Suppose we seek a d-confidence interval (here we specified 6= 0.95). Then a 6-confidence 
interval on [x is 


Ox |» OX &: & 
[sara + jix(n), *(1+8)/2 TF ate x(n) . (6.3-12) 


How do we know that, on the average, it is the shortest interval? Because of the symmetry 
of the Normal pdf, the largest amount of probability mass is at the center. Any other 95 
percent interval will require more support, that is, need a longer length. 

Let us return to what was asked for. The question as to how many samples are needed 
for a shortest 95 percent confidence interval cannot be determined if ox is not known. 
Clearly, by choosing a large enough interval, for example, a ten-sigma width on either 
side of jiy(n), we shall get a 95 percent (and more!) confidence even when the number of 
samples, n, is small. But with a ten-sigma width on either side the interval will not be the 
shortest and will prove useless because it 1s too large. So let us assume that it is the shortest 
interval that we seek. Then the interval will be centered about jix(n) and have width 
Wo.95 = 2 x 1.960 x/,/n. So clearly, the ratio ox /,/n determines the width of the interval. 
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If ox is known (this is unlikely in practice), then we can determine how many samples 
we need to obtain a confidence interval of a specified length. For an arbitrary d-confidence 
interval, the width of the confidence interval is 


Ws — ee 2(14+6)/2 x ox/V/n. (6.3-13) 


Not surprisingly we find from Equation 6.3-13 that the interval gets wider (which increases 
our uncertainty as to the true mean) when the standard deviation of X increases but gets 
smaller (which decreases our uncertainty as to the true mean) when the number of samples 
increases. Also the interval gets wider when the demanded percent confidence increases. 
Does this make sense? 


Procedure for Getting a 6-confidence Interval on the Mean of a Normal 
Random Variable When ox Is Known 


(1) Choose a value of 6 and compute (1 + 6) /2; 

(2) From the tables of the CDF for the standard Normal find the percentile z(1+5)/2 
such that Fn (2(145)/2) => (1 + 6) /2; 

(3) Obtain the realizations of X;,7 = 1,...,n. Label these numbers a;, i = 1,...,n. 


n 
Compute the numerical average ju, = 4 So ws: 


(4) Compute the interval [-z045)2% + ps; 2(145)/2 + ps|- 


Up until now, we have assumed that ox is known. However, ox is typically not known (Can 
you think of a situation where we do not know jux but know ax?) One possible solution to 
this problem is to replace ox in Equation 6.3-11 by an estimated value of it, for example, 
Gx(n), the square root of Equation 6.3-3, and continue with our assumption that Y is 
Normal. But in fact Y would not be Normal because of the randomness in 6 x(n) and this 
might not yield accurate results especially when the sample size is not large. Not knowing 
ox requires that we seek another approach for determining a prescribed confidence interval. 
Such an approach is furnished by the t-distribution discussed below. 


Confidence Interval for the Mean of a Normal Distribution When ox 
Is Not Known 


In general, the distributions one encounters in statistics are often of an algebraic form that 
is more complex than those we encounter in elementary probability. One of these is the 
so-called “student’s” t-distribution introduced by W. S. Gossett in connection with his 
work of computing a confidence interval for the mean of a Normal distribution when the 
variance is not known. Gossett is considered one of the founders of modern statistics but is 
better known by his pen name Student’. As we saw in our previous discussion, the problem 
of finding the end points of a confidence interval involves the distribution of the N(0, 1) RV 


11876-1937. Much secrecy enveloped his work on statistical quality control at the Dublin brewery of 
Arthur Guinness & Son. For this reason he used the pseudo name “Student.” 
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A fix (n) — Lx 
ox//n - 


However, without knowledge of ox we cannot find the end-points that define the confidence 
interval. So we create a new RV by replacing ox by 


y¥ 


n 


1/2 
ext) = (ALY O- ason?) | 


11 


which is merely the sigma value derived from the VEF of Equation 6.3-3. This new RV is 
defined by 


Thi ® —— jix(n) ~ nx (63-14) 
n n 1/2 ae ss A 
Con 2 (Xs i fx(n))?) x(n)/V/n 


and is said to have a t-distribution with n — 1 degrees of freedom for n = 2,3... We do 
not treat T,,_; as an approximation to a standard Normal RV. As n changes, we generate 
a family of t-distributions. We denote the pdf associated with T,-1 by fr(a;n — 1). The 
important thing to observe is that 7,1; does not involve the unknown ox, a fact that 
enables us to compute confidence intervals on the mean j4x, something we could not do 
using the RV in Equation 6.3-9. 

It is important for the reader to understand that in creating the t-distribution we did 
not approximate ax by ox. The brilliance of the contribution of Gossett was in avoiding 
approximations required to use the Normal distribution and working instead with T,,_; and 
its distribution. 

For insight, we can rewrite Equation 6.3-14 as 


n—-1 — 1/2 1/2? 3 
i Xi-fix (n) (Zn—1/n — 1) 
(n—1) ox 
=1 
en 
where Y & — px) \/n/ox : N(0,1) and Z,_1 = sy", ix) has a y2_, pdf 
Ox 


with n — 1 degrees of freedom. When spelled out the symbol y? is written Chi-square 
(pronounced ky-square as in sky-square). The subscript of the Chi-square RV gives the 
number of degrees of freedom (DOF) and the RV range is (0,00). This implies that the 
CDF F,2(z;n) = 0 for z < 0 for every integer n > 1. The y? distribution was intro- 
duced in Chapter 2 and is sometimes called a sampling distribution because it involves 
i.i.d. samples of a population X. It is not obvious but Y and Z,,_ 1, although sharing the 
same X;,7 = 1,...,n, can be shown to be statistically independent (see Appendix G). From 
Equation 6.3-14 we see that that the t-random variable is the ratio of a standard Normal 
RV (numerator) to the square root of a quotient of a Chi-square RV divided by the DOF. 

For large values of n the t-distribution will not be that different from the Normal (see 
Figure 6.3-1). Indeed the pdf of T,,_1 is centered at the origin and symmetrical about it. In 
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t-pdf versus Normal pdf 
FCS), OND. fy OD 


F(x) 


Figure 6.3-1 The probability density function of the T random variable has a shape similar to that of 
the Normal pdf, especially as the number of degrees-of-freedom get larger. Here is shown the t-pdf for 
n = 3 (peaks at 0.36); n = 13 (the curve with the boxes that peak at 0.39); and the SN pdf. Except for 
a barely observed variation in the tails, the n = 13 t-distribution is virtually identical with the Normal. 


seeking the shortest confidence interval for jz, we consider the event {—t5/2 < Tn—1 < t5/2}.- 
The probability of this event is 
P(-ta+e)/2 < Thi < t(1+s)/2] = 0, (6.3-16) 


where, as before, 100 x 6 is the assigned percent confidence for interval on x. With the 
CDF for the T,-1 RV denoted by Fr(t;n — 1) = a fr(a;n — 1)dz, we find that 


6= 2F r(ti+sy/2,n —1)-1 


or, equivalently, 
1+6 
Fr(t 146) /23 ie 1) = oe (6.3-17) 


From the tables of the cumulative t-distribution with DOF n — 1 in Appendix G, we can 
determine the t-percentile t(,45)/2. Finally, from Equations 6.3-14 and 6.3-16, we obtain 


. t 6x(n) 7 t x(n) 
P [fig (n) = PEAY ey < fxe(n) + EDEN) _ g, 


which gives as a 1006 percentage confidence interval 


7 t(146)/20x(n) . t(146)/20 x(n) 
fil) =  sfix(n) + “CE (6.3-18) 
The width of the confidence interval is 
t n 
W; = o tatsy/2Fx (nm) (6.3-19) 


7 
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Procedure for Getting a 6-Confidence Interval Based on n Observations 
on the Mean of a Normal Random Variable when ox Is Not Known 


(1) Choose a value of 6 and compute (1 + 6) /2; 

(2) From the tables of the CDF for T,,_1, find the t-percentile number ¢(45)/2 such that 
Fr(ta46y/23n — 1) = (1+ 6)/2; 

(3) Obtain the realizations of X;,i = 1,...,n. Label these numbers a;,i = 1,...,n. 
Compute the realizations of jix(n),éx(n); 


ta 46)/20x (n) 
vn : 


(4) Compute the numerical realization of the interval E x(n) — 
* t ax (n) 
fix(n) + ‘asnsge 

Example 6.3-3 

(confidence interval on ux when ox is unknown-Normal case) Twenty-one i.i.d. obser- 

vations (n = 21) are made on a Gaussian RV X. These observations are denoted as 

X1,X9,...,X21. Based on the data, the realizations of jiy(n) and Gx(n)/./n are, respec- 

tively, 3.5 and 0.45. A 90 percent confidence interval on ji (n) is desired. 


Solution Since P[—to.95 < Too < to.95] = 0.9, we obtain from Equation 6.3-17 F(to.95, 20) 
= 0.5(1 + 0.9) = 0.95. Entering the student-t tables at F = 0.95 and n = 20 we obtain 
to.95 = 1.725. The corresponding interval, from Equation 6.3-18, is [3.5 — 1.725 x 0.45,3.5+ 
1.725 x 0.45] = [2.72, 4.28]. The width of the interval is Ws ~ 2 x 1.725 x 0.45 = 1.55: 


Interpretation of the Confidence Interval 


The confidence interval generated from a series of realizations either will or will not include 
the true mean of X, which is a number unknown to us. Therefore, what does it mean to 
say that we have a “90 percent” confidence interval? The answer to this question goes to 
the heart of the meaning of probability, namely the frequency of a desirable outcome in 
repeated trials. Put succinctly, a “90 percent” confidence interval means that, say, in a 
thousand trials, one will observe that the interval covers the true mean about 900 times. 
Will we observe exactly 900 true-mean coverage? Not likely, but a success rate of 900 is the 
most likely outcome. 


6.4 ESTIMATION OF THE VARIANCE AND COVARIANCE 


We make n observations X,,X9,...,X,, on a Normal RV X with mean zy and variance 
o%. If zx is known then an unbiased VEF is computed from the random sample as 


nm 


as enue (6.4-1) 
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and it is not difficult to show that &%(n) is an unbiased, consistent estimator of 0%. If the 
mean is not known, then the VEF 


FR(n) = (i Ax(n))? (6.42) 


is an unbiased and consistent estimator of oF : 


Unbiasedness of 6% (n) of Equation 6.4-2. Consider 


2 
n 


EIS°|X- ee 
j=l 


i=1 


2 ple 1 | 
= > xP Ex 2 Xa + | 
ee j=l k=1j>k 


“hae (6.4-3) 


In obtaining Equation 6.4-3, we used the fact that E[X?] = 0? + y?,i=1,...,n. Clearly if 


% a = (n—1)o? 


then 


n 


e| : Sex| =o". (6.4-4) 


n—14 
i=1 


But the quantity inside the square brackets is 6%(n) of Equation 6.4-2. Hence 6%(n) is 
unbiased for o?. 


Consistency of 6 %(n) of Equation 6.4-2 The variance of 6%(n) is given by 
Var[éx(n)] = E[(6%(n) — 0?)?] 


=E Gee pe < — p)*+ SOS — pi)(X; — pi)? 


i=1 ij 


A straightforward calculation shows that for n >> 1 


Var|a%(n)] ~ es, (6.4-5) 
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where cy 2 B [((X1 — )*] (see Equation 4.3-2a). Assuming that cq (the fourth-order central 
moment) exists, we once again use the Chebyshev inequality to write that 


var|[é2(n)] C4 n-00 


P|a%(n) — o?| > e] < ——— 0. (6.4-6) 


e? née 
Hence &%,(n) is a consistent estimator for o?. 


Example 6.4-1 
(computing the numerical sample mean and numerical sample variance of a Normal random 
variable) Ten observations are made on a Normal RV X:N(3,1/10). The realizations are: 
3.12, 2.87, 3.04, 2.77, 2.89, 3.34, 3.51, 2.44, 3.28, and 2.95. To compute the numerical sample 
mean and the numerical sample variance, we proceed as follows: 

The numerical sample mean is computed as 


Hs = aacRe + 2.87 + 3.04 + 2.77 + 2.89 + 3.34 + 3.51 + 2.44 + 3.28 + 2.95) = 3.02 
The numerical sample variance is computed as 
= 5 (0.01 + 0.225 + 0.0004 + 0.0625 + 0.0169 + 0.1024 + 0.2401 
+ 0.3364 + 0.0676 + 0.0049) 
= 0.096. 
In signal processing the ratio (4,/0;)? is sometimes called the signal-to-noise (power) ratio; 


in this case it is 95. It is commonly given in decibels (dB), which in this case is 10x log;) 95 = 
19.8 dB. 


Confidence Interval for the Variance of a Normal Random variable 


Determining a confidence interval for the variance involves the x? distribution. Suppose 
we make n i.i.d. observations on the Normal RV X and label these observations as Xj, 
X2,...,Xn. Then, for each 2 
XxX; = 
ee eS (6.4-7) 
ox 


is N(0,1) and Z,, & S> U? is Chi-square distributed with a DOF of n. The x? pdf is shown 
i=l 

in Figure 6.4-1 and is denoted by fy2(a;n). If wx is not known in Equation 6.4-7 and we 

replace it with fix (n) from Equation 6.3-2 we create a new RV 


ees Xi = fix(n) (6.4-8) 


ox 


n 
and the sum Z,_1 = >> V7 is also Chi-square but with n — 1 degrees of freedom. 
i=l 
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Chi-square pdf for n=2, n=10 


pdf value 


X 
5 10 15 20 25 


0 
aie 


Figure 6.4-1 The Chi-square pdf for n = 2 (curve with value 0.5 at the origin) and n = 10. For all 
values of n > 2, the pdf will be zero at the origin. 


Example 6.4-2 
(computing the degrees of freedom of a Chi-square RV) With the V; defined in Equation 


2 
6.4-8, the random variable )* V,? is Chi-square with a DOF of unity. We can see this with 


i=1 
the help of a little bit of algebra. We find that V? + V? = pee egal, But 
. U? there is only one 


us (x (X, — X2) /ax V2 is N(0,1) and hence in the sum Z,, Be 
i=1 


nonzero term, that is, U? = Z,. 
To find a confidence interval on o% at level, say, 6 (e.g., 6 = 0.95, 5 = 0.98, 5 = 0.99), 


we begin with 
n 


Ke n 1 
Wr-1 = x v= oy y 


i=l i=l 


and seek numbers a,b such that Pla < W,,_1 < b]=P bua < + > (X;- fix(n))? <b] =6. 


ea i=l 
the event {¢ : } 33 (X;—jfix(n))” < 0% < 4D (X%;—jix(n))’}. Hence the width of the 
i=1 i=1 


confidence interval for the variance ist 


1 1 n 
Ws(a,b) = (~ — =) YO (XG = fix(n))’. (6.4-9) 
i=1 
Since Wai = =r » (X; — fix (n))” is y2_, we solve for the numbers a,b from Pla < 
Wr-1 <b] = Fa(by n— es 2(a;n—1). To avoid the algebraic difficulties associated with 


+Please do not confuse the width symbol W;(a, b) with the x? random variable symbol W,,. 
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finding the shortest interval, we find numbers a, b that give a near-shortest interval as follows: 
The probability that W,,_1 lies outside the interval is 1—d = 1— F\2(b;n—1)+Fy2(a;n—1); 
if we denote 1 — 6 as the “error probability” and assign 1 — F\2(b;n — 1) = (1 — 6)/2 and 
Fy2(a;n —1) = (1 — 6)/2, then we have divided the overall “error probability” into equal 
area-halves under the tails of the y2_, pdf. It then follows that a = L(1—5)/2, that. is, 
Pye (1-6) /23 n—1)=(1—0)/2 and b= L145) /25 that is, Pye ((148)/23 n—1)=(1+0)/2. 
The numbers 2(1~5)/2 and 2 (145)/2 are called, respectively, the (1 — 6)/2 and (1+ 6)/2 
percentiles of the y?2_, RV. The 6-confidence interval for the variance is 


n n 


Sy ea Sox avin | 


U(146)/2 ay (1-6) /2 Fay 


and its length L is 


1 1 = . 
{( Joe -axtayh , 
X(1—6)/2 L1+5)/27 Fay 
Example 6.4-3 


Sixteen i.i.d. observations are made on X:N(ux,0%). A confidence interval on o% is 
required. Find the numbers a, b that will give a near-shortest 95 percent confidence interval 
o% using the “equal error probability” rule. 


Solution F)2(a;15) = Fy2(#o.025; 15) = 0.025. Fy2(x0.975; 15) = 0.975. From the table of 
the Chi-square distribution, we find a = 29.925 = 6.26 and b = %o.925 = 27.5. 


Estimating the Standard Deviation Directly 


We can estimate the standard deviation ox from 


1/2 
exo = ( . Ss an (6.4-10) 


n—-1¢ 
i=1 
but this involves computing 6%(n) first. Another approach estimates ox directly. Consider 
two i.i.d. observations X1, X2 on the generic RV X. Let Z = max(X1, X92), fi = (Xy + X2) /2. 
The pdf of Z is readily computed as fz(z) = 2F'x(z)fx(z), where Fy(z) and fx(z) are, 
respectively, the CDF and pdf of X. Now consider the estimator 6x 


6x 2 Va(Z — fix) (6.4-11) 
and compute E[éx] 2 /7E|(Z — jix)] = Va(E[Z] — 1x). The computation of E[Z] when 


X is Normal can be done with the aid of standard tables of integrals (see Handbook of 
Mathematical Functions, M. Abramowitz and I. A. Stegun, eds., Dover, New York, 1970, 
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p. 303, formula 7.4.14), or with Maple, MathCAD, Mathematica, etc. We find that E[Z] = 
lx + JROX so that E[éx] = ox. Hence gx 4 Jm(Z — jx) is an unbiased estimator 
for ox. 
Example 6.4-4 
(one-shot estimation of ox) Two realizations of X : N(uwx,0%) are obtained as 3.8, 4.1. 
Then (primes indicate realizations) Z’ = max(3.8, 4.1) = 4.1, fi’ = (3.8 + 4.1)/2 = 3.95, 
and o'y = 0.26. Computing o’y from Equation 6.3-6 yields 0.21. 


To compute the variance of the standard deviation estimator function (SDEF) in 
Equation 6.4-11 we write: 


Var(6x) = m(B[Z?] + Elf] - 28|Zjrx]) — 0%. 
This computation takes some work but the result is 
Var(éx) = (5 - 1) o% ~ 0.570%. (6.4-12) 


In practice we would not want to estimate 0x from only two observations on X. Suppose 
we make n (even) observations on X, which we denote as X1, X2,...,X» and pair them as 
{X1, Xo}, ceey {Xn-1, Xn}. Let 


aD 2 /m (max(X1, X2) — 0.5(X1 + X2)) 


62 2 \/m (max(Xs, X4) —0.5(X3 + X4)) 


6@!?) 2 fg (max(Xn_1,Xn) —0.5(Xn_1 + Xn)) 


and define 
1 n/2 
~ A - (i) 
Fave = 75 » ae (6.4-13) 


n/2 


= = >= Var(x), which gives 
i=l 


Then Var (Gave) 


1.04 
Var(Gave) © wok. (6.4-14) 


It is straightforward to show that Gg,_ is a consistent estimator for ox; we leave this 
as an exercise for the reader. A confidence interval for ox based on estimating ox with 


ox 2 \/m(Z — ju) is discussed in [6-3] and [6-4]. 
Estimating the covariance 


The covariance, defined by 


en  Cov[XY] = E[(X — nx) (¥ — ny)], (6.4-15) 
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is classically estimated from the covariance estimating function (CEF) 


dr AO (X= x(n) x (= y(n), (6.4.16) 


where {X;, Y;,i = 1,...,n} are n paired i.i.d. observations. We leave it to the reader to 
show that C1, is an unbiased and consistent estimator for c;,. The normalized covariance, 
also called the correlation coefficient, is defined as 


A C11 
Pxy = ——S. (6.4-17) 
Voxoy 


It is estimated from 


ey 2 = Ela i= Axle) xv) gg 


BRGY (OM (Ki — x(n)? Dy (i — ty (n))?) 


The distribution of py y is not available in closed form. However, a confidence interval for 
Pxy can be found using more advanced methods [6-1]. 


6.5 SIMULTANEOUS ESTIMATION OF MEAN AND VARIANCE 


If we seek, say, a 95 percent confidence region on both wy and o% we take advantage of 
the RVs jix(n) and &%(n) being independent. Thus, we may write 


P 


e Hex(n \= Ux < a,b < i : (Xi = jix(n))” < | = 0.95 (6.5-1) 


or, equivalently, 


rf aster 


Equation 6.5-2 follows from Equation 6.5-1 because of the independence of the events 


n 


b< 


1 
o2 — 
OX jal 


oS yr <q =0.95. — (6.5-2) 


n 


A fix (n) — bx A A. —mfictnivr 2s 
m2 { a< ae <a} and Bs # fos jix(n))° < \. 


We note that . 
ge Ux (n) = bx 
Ox /J/n 
is the standard Normal RV N(0,1) with distribution function Fs y(z) while 


n 


A 


1 2 
W, = a 
ox 


i=l 


is x?_, with distribution function F,2(a;n — 1). 
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The next step is to associate a probability to each of the events E,, Ey. As an example 
we could factor the joint 5-confidence as 6 = V6 x V6; this would give for 5 = 0.95 


P | gc Pete 2 (| = 0.95 ~ 0.975 (6.5-3) 
ox/J/n 
and 
1 n 
Plb< => Y°-(%-fx(n))’ < | = V0.95 ~ 0.975. (6.5-4) 
Ox =) 


From Equation (6.5-3) we recognize that a@ = 20.9875, that is, F's~ (20.9875) = 0.9875, 
the 98.75 percentile of the standard Normal RV. From Equation 6.5-4, we determine—using 
the “equal-error” assignment rule to the tails of the Chi-square pdf-that b = 20.9125 and 
C = Xo.9875, that is, the 1.25 and 98.75 percentiles of the cumulative Chi-square distribution 
Fy2(a;n — 1). More generally, for any given 6-confidence interval and any given n, we can 
find numbers a, b, and c to satisfy the confidence constraints. Once this is done we can find 
in the y4,07 parameter space the boundaries of the d-confidence region for .x,0%. Event 
E, is the convex region inside the parabola described by 0? = n(— fix)? /a?. Event Ey 
is the region between the end points 


Oitax = § L. (Xi — fix(n))” (upper bound), 
el ; (6.5-5) 
ORtin = 4D (Xi — fix (n))? (lower bound). 
i=1 
The event EM EF is then the shaded region shown in Figure 6.5-1. 
In approximately 950 in a 1000 cases, the region shown in Figure 6.5-1 will cover the 
point wx,0%, that is, the true values of the unknown mean and variance. 


07=n(Ul-f,)*/ a? 
i / / of, 


fly 


Figure 6.5-1 The confidence region for the combined estimation of jz and o”. 
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Example 6.5-1 


(confidence region for mean and variance) We make 21 observations {X;,i = 1,...,21} on 

a Normal population X : N(wx,0%). A 90 percent confidence region is desired for the pair 
2 

Ux, Ox: 


To achieve a 90 percent confidence region, we assign (approximately) a 0.95 probability 
that the N(0,1) RV Z lies in the interval (—a,a) and a 0.95 probability that the Chi- 
square Rv W with DOF of 20 lies in the interval (b,c). From Equation 6.5-3 we obtain 
Pl{—2z0.975 < Z < 2.975] = 0.95; hence, from the standard Normal distribution table, we 
find Fsn (20.975) = 0.975 or 20.975 = 1.96. From Equation 6.5-4 we obtain P[b < W < c] 
= 0.95, from which we determine numbers b = 2%0.925,¢ = %o.975 using the “equal-error” 
assignment of Example 6.3-3. Thus, F),2 (20.025; 20) = 0.025 and F\2(2%0.975; 20) = 0.975 so 
that 0.025 = 9.59 and 0.975 = 34.2. The numbers X0.025 and2.975 are the 2.5 and 97.5 
percentiles, respectively, of the x? RV. 


6.6 ESTIMATION OF NON-GAUSSIAN PARAMETERS FROM LARGE SAMPLES 


Consider an RV X with mean yp and finite variance ¢?. We make n i.i.d. observations on 
X{X;,i = 1,...,n} and deduce from the Central Limit Theorem that the sample mean 


estimator! (SME) 
P lt 
ji(n) = is > X; 
i=1 


is approximately Normal as N(y,0?/n) for large n. If X is a continuous RV then the SME 
is approximately Normal in density, else it is approximately Normal in distribution. When 
the parameters to be estimated are associated with non-Gaussian distributions, it may still 
be possible to estimate them using Equation 6.6-1 as a starting point: 


P|-ax a <| =6. (6.6-1) 


which can be rewritten as 
P|(-ao/Vn) + fu < ws (ao/Vn) + fil = 0. (6.6-2) 


The reader will recognize that this is the expression for 100 x 6 percent confidence 
interval for yz. When distributions are non-Gaussian, the mean and variance may be related 
parameters, that is, o = a(,1). How do we handle such cases? We illustrate with two examples 
from [6-2]. 


Example 6.6-1 
(confidence interval for X in the exponential distribution) Suppose we want to estimate 
in the exponential pdf fx (2) = Ae~**u(zx). For this law we find 


*Recall we use the mean-estimator function and the sample mean estimator interchangeably. 
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A 


p= EX] =| wre>* da = d~* 
0 


and re 
oS El(X — u)"] ay (X—~*)? re ** dz = d-*. 
0 


Inserting these results into Equation 6.6-2 and rearranging terms to expose \ yields 
- il 1 
P a <A< (o/vn) +1 = 0. (6.6-3) 


LL ° Lt 

The number a is obtained from approximating Z = (ft — p)\/n/o as a N(0,1) random 
variable. This yields a = 2(145)/2 where Fn (z(145)/2) = (1+ 6)/2 Thus a 100 x 6 percent 
confidence interval for \ has width 


Wo = 22(148)/2/i/n (6.6-4) 


Example 6.6-2 
(numerical evaluation of confidence interval for A) It is desired to obtain a 95 percent 
confidence interval on the parameter A of the exponential distribution from 64 i.i.d. obser- 
vations on an exponential RV X. The estimate is fy = 3.5. From Equation 6.6-1 we obtain 
2xerf(a) = 0.95 or, equivalently, F's yy (z(145)/2) = (1+ 6)/2 = 0.975. This gives 29.975 = 1.96. 
Then from Equations 6.6-3 and 6.6-4 we compute that the 95 percent confidence interval 
for \ is {0.22 , 0.36} and has an approximate width of 0.14. 


Example 6.6-3 
(confidence interval for p in the Bernoulli distribution) Given a Bernoulli RV X, with 
probability P|[X = 1] = p, and P[X = 0] = q = 1 —p we want to estimate p at a 
100 x 6 percent level of confidence from n (sufficiently large) i.i.d. observations on X. For 
this distribution pry 4 E[X] = p and the MEF is j = (1/n) 0, X;. As demonstrated in 
earlier chapters E[p] = p and Var[p| = + yg Var npg = pq/n. 

Hence the RV 


gh Pap 
Vpq/n 


for large n is Normal in distribution (not in density since X is a discrete RV) as N(0,1). 
To obtain a 100 x 6 confidence interval on p we write 


(6.6-5) 


<al=5 (6.6-6) 


or, equivalently, 

Pl(p— p)? < a?pq/n] = 6. 
As usual we find the constant a from 2 erf(a) = 6, that is!, a = z 148 and find the end points 
of the confidence interval by solving for the roots of (p—p)? — a?pq/n = O(where q = 1—p). 
These are 


tRecall that 2 x erf(a) = 2 x Fgn(a) -1=6 so that a = X(145)/2 ie. the (1 + 6)/2 percentile of the 
standard Normal RV. 
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95 percent interval on the 
Bernoulli probability p 


Interval width 
° 
wo 
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Figure 6.6-1 The width of the confidence interval decreases slowly with an increase in the number of 
samples in Example 6.6-3. Here we assumed that 4pq ~ 1. 


_ 2p ale (a? /n) 1 ST yaitae a) ae 
aa 2(1+(a2/n))  2(1 + (a2/n)) V (a?/n)[(a?/n) + 44] 
= 2p T (a?/n) 1 7) a a 5 FT 
2 2(1 + (a2/n)) + 2(1 + (a2Jnyy W /n)|(a?/n) + 4p, 


giving an interval width 


Wy, = lp2 — pil = Faz Tay) V (a?/n)[(a?/n) + 4p4]. (6.6-7) 
The width of the interval decreases slowly with sample size Figure 6.6-1. 


Example 6.6-4 
(how fair is the “fair” coin) We wish to obtain information about the “fairness” of a coin. 
For this purpose the coin is tossed 100 times and 47 heads are observed. A 95 confidence 
interval on p, the probability of a head, is desired. Using the MEF we find that p’ = 0.47. We 
find a from 2 x erf(a)= 0.95 or a = 1.96 and from Equation 6.6-7, Ws ~ 0.192. The interval 
is centered at 0.47 and extends from 0.37 to 0.57. The interval includes the “fair” coin value 
of p = 0.5 and we have no basis for believing that the coin is biased. If the number of i.i.d. 
observations increases to 1200, and we observe 564 heads, then p’ still has value p’ = 0.47 
but the 95 percent interval is {0.442, 0.492} and does not include the “fair” value of 0.5. 
This strongly suggests that the coin has a slight bias in the direction of getting more tails. 


6.7 MAXIMUM LIKELIHOOD ESTIMATORS 


In the previous sections we furnished estimators for the mean, variance, and covariance of 
RVs. While these estimators enjoyed desirable properties, they seemed quite arbitrary in 
that they did not follow from any general principle. In this section, we discuss a somewhat 
general approach for finding estimators. This approach is called the maximum likelihood 
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(ML) principle and the estimators derived from it are called maximum likelihood estimators 
(MLEs). The main drawback to the MLE approach is that the underlying form of the pdf 
of the observed data must be known. The idea behind the MLE approach is illustrated in 
the following example. 


Example 6.7-1 
Consider a Bernoulli RV that has PMF Px(k) = p*(1— p)!~*, where P[X = 1] = p, and 
P(X = 0] =1—p. We would like to estimate the value of p with an estimator, say, p, that is 
a function only of the observations on X. Suppose we make n observations on X and we call 
these observations X,, Xo,...,Xn. Then Y = a X; is the number of times that a one 
was observed in n tries. For example, the experiment might consist of tossing a coin n times 
and counting the number of times it came up heads, that is, {X = 1}, when the probability 
of a head is p. Suppose this number is ky. The a priori probability of observing k; heads 


is given by P[Y = ki; p] = (; ) p'(1 — p)"—™, We explicitly show the dependence of the 
1 

result on p because p is assumed unknown. We now ask what value of p was most likely to 

have yielded this result? Since the term on the right is a continuous function of p, we can 

obtain this result by a differentiation. Setting the derivative to zero yields 


a 7 (7, Joh — pk, (1 — p) — p(n — kx)] = 0. 


Thus, there are three roots: p = 0, p = 1, and p = ky /n. The first two roots yield a minimum 
while p = k,/n yields a maximum. Thus, our estimate for the most likely value of p in this 
case is k,/n. Had we performed the experiment a second time and observed kg heads, our 
estimate for p would have been k2/n. These estimates are realizations of the MLE for p: 
n 
Xi 
1 


i= 


p= — (6.7-1) 


In the previous example we used the fact that the distribution of yy X; is binomial. Could 
we have obtained the same result without this knowledge? After all, for some distributions 
it might be quite a bit of work to compute the distribution of the sum of RVs. The answer 
is yes and the result is based on generation of the likelihood function. 


Definition 6.7-1 The likelihood function' L(@) of the random variables X1, 
Xo,...,Xn is the joint pdf fx,x,...x,,(@1,@2,°++ ,Xn;@) considered as a function of the 
unknown parameter 6. In particular if X,, X2,--- , X, are independent observations on a RV 
X with pdf fx (a; 6), then the likelihood function for outcomes X; = 21, X2 = X2,..., Xj = 


Xi,.--,Xn =X, becomes 
n 


L(6) = [J fx (ai; 9) (6.7-2) 


il 


Strictly speaking we should write L(0;21,2%2,...,%n) or, as some books have, L(6;X 1, X2,...,Xn). 
However, we dispense with this excessive notation. 
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since the {X;} are i.i.d. RVs with pdf fx (a; 6). If, for a given outcome X = (21, ¥2,-°++ , Xn), 
6" (x1, %2,°++ ,@n) is the value of 6 that maximizes L(A), then 6*(21,2%2,-+- , YZ) is the ML 
estimate of 9 (a number) and6 = 6*(X1, X2,--- , Xn) is the MLE (an RV) for 0. It is there- 
fore, quite reasonable to define the likelihood function as the RV L(@) S Thies fx (4% 0). 
Then, maximizing with respect to 0 yields the MLE@(Xq, --. ,X,) directly. I 


Example 6.7-2 
We consider finding the ML estimation of p in Example 6.7-1 using the likelihood function. 
If we make n i.i.d. observations X1, X2,--- ,X, on a Bernoulli RV_X, the likelihood function 
becomes L(0) = [[j_, p™'(1—p)!-® = p&i=1 % x (1—p)""Li=1*, By setting dL(0)/d0 = 0, 
we obtain three roots: p = 0,p = 1, and p = 37, 2;/n. The first two roots yield a 
minimum, while the last root yields a maximum. Thus, p*(x) = )>;_, x;/n and the MLE 
of p is p = p*(X1, Xe, gtoseis Xn) = ya X;/n. 


In many cases the differentiation is more conveniently done on the logarithm of the likelihood 
function. The log-likelihood function is log L(0) (usually the natural log is used) and has 
its maximum at the same value of @ as that of L(@). Another point is that the MLE 
cannot always be found by differentiation, in which case we have to use other methods. 
Finally, multiple-parameter ML estimation can be done by solving simultaneous equations. 
We illustrate all three points in the next three examples, respectively. 


Example 6.7-3 
Assume X:N (1,07), where o is known. Compute the MLE of the mean p. 


Solution The likelihood function for n realizations of X is 


L(y) = (=) exp (-2 i = “) (6.7-3) 


Since the log function is monotonic, the maximum of L(j2) is also that of log L(y). Hence 


1 n 
log L(y) = —F los ( (2707) — I52 a (Hi 
and set 
Olog L(x) _ 4 
Ou : 


This yields 


368 Chapter 6 Statistics: Part 1 Parameter Estimation 


which implies that the MLE of yz should be 
p= Xj. (6.7-4) 


Thus, we see that in the Normal case, the MLE of 4 can be computed by differentiation 
the log-likelihood function and that it turns out to be the sample mean. 


Example 6.7-4 
Assume X is uniform in (0,6), that is, 


1 
a, 05 250, 


Fx(a) = a x >, 


and we wish to compute the MLE for @. Let a particular realization of the n observations 
Xi, 0-.,Xn be x = (a1,...,2n)" and let 2 = max(x1,...,2n). The likelihood function is 


1 
= Qn? Lm < 0, 
Hn) fi otherwise. 


Clearly to maximize L we must make the estimate 6’ as small as possible. But 6’ cannot be 
smaller than x,,. Hence 6’ is x», and the MLE is 


6 = max(X),..., Xn). (6.7-5) 
The CDF of 6 for n = 2 is 
Fy(a) = Fx, (a) Fx, (a) = F¥(a). (6.7-6) 


We leave the computation of the CDF and pdf of 6 for arbitrary n as an exercise for the 
reader. 


Example 6.7-5 
Consider the Normal pdf 


1 1 
fx (x; ,07) = Tazo (-sate- w*) —0o <@< 00. 


The log-likelihood function, for n realizations, is 


L(t, 0) = log L = —5 log 2m — nlog a 


n 


55) en): (6.7-7) 


Now set 
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and obtain the simultaneous equations 


Dei - 1) = (6.7-8) 


i=1 
eae Fr =0 (6.7-9) 
—--+—5 Li = 0. .7- 
a oa a . 
From Equation 6.7-8 we infer that 
sys (6.7-10) 
on i=l . 


gan S0(Xi — 2). (6.7-11) 


MLEs have a number of desirable properties including squared-error consistency and invari- 
ance. Invariance is that property that says that if@ is the MLE for 6, then h@) is the MLE 
for h(@). However, as seen in Example 6.7-5, (Equation 6.7-11) ML estimators cannot be 
counted on to be unbiased. We complete this section with an example that illustrates the 
invariance property. 


Example 6.7-6 
Consider n observations on a Normal RV. Assume that it is known that the mean is zero. 
The MLE of the variance is 6? = + )>?_, X?. The standard deviation o is the square root 
of the variance. Hence the MLE of the standard deviation is the square root of the MLE 


for the variance, that is, 6 = (407, x2)? 


6.8 ORDERING, MORE ON PERCENTILES, PARAMETRIC VERSUS 
NONPARAMETRIC STATISTICS 


We make n i.i.d. observations on a generic RV X (recall that X is sometimes called a 
population) with CDF Fx(a) to obtain the sample X1, X2,...,Xn. The joint pdf of the 
sample is fx(a1) x... x fx(a@n),-00 < aj < co, i= 1,...,n. Next we order the X;, i = 
1,...,n, by size (signed magnitude) to obtain the ordered sample Y,,Y2,...,¥n such that 
—o < Yj < Yo <--: < Y, < co. When ordered, the sequence 3, —2,—9, 4 would become 
—9,—2, 3, 4. If a sequence X,...X29 was generated from n observations on X : N(0,1), 
it would be very unlikely that Y; > 0 because this would require that the other 19 Y;,7 = 
2,...,20, be greater than zero and therefore all the samples would be on the positive side of 
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the Normal curve. The probability of this event is (1/2)?°. Likewise it would be extremely 
unlikely that Yo < 0 because this would require that the other 19 Y;, 7 = 1,...,19 be less 
than zero. As shown in Section 5.3, the joint pdf of the ordered sample Y,, Y2,...,Yn is 
mfx (yi) x +++ & fx (Yn), -08O < YL < Yo < +++ < Yn < & and zero else. We distinguish 
between ordering and ranking in that ranking normally assigns a value to the ordered 
elements. For example, most people would order the pain of a broken bone higher than that 
of a sore throat due to a cold. But if a physician asked the patient to rank these pains on a 
scale of zero to ten, the pain associated with the broken bone might be ranked at eight or 
nine while the sore throat might be given a rank of three or four. 

Consider next the idea of percentiles. We have already used this concept in numerous 
places in earlier discussions; here we elaborate. Assume that the IQ of a large segment of the 
population is distributed as N(100, 100), that is, a mean of 100 and a standard deviation of 
10. Obviously the Normal approximation is valid only over a limited range because no one 
has an IQ of 1000 or an IQ of —10. The IQ test itself is valid only over a limited range and 
may not give an accurate score for people that are extremely bright or severely cognitively 
handicapped. It is sometimes said that people in either group are “off the IQ scale.” Still 
the IQ test is widely used as an indicator of problem-solving ability. Suppose that the 
result of an IQ test says that the child ranks in the 93rd percentile of the examinees and 
therefore qualifies for admission to selective schools. How do we locate the 93rd percentile 
in a population of n students? 

Definition (percentile): Given an RV X with CDF F(x), the u-percentile of X is 
the number «x, such that Fy (2,) = u. If the function F'y is everywhere continuous with 
continuous derivative, then x, = Fy'(u), where Fy’ is the inverse function associated 
with Fx, that is, Fy! (Fx(au)) = ¢y. A CDF and its inverse function is shown in Figure 
6.8-1. In keeping with common usage, we use 2, or 100 x 2, interchangeably to mean 
Ly-percentile. 


x,=Fy\(u) 1 


(a) (b) 


Figure 6.8-1 (a) u versus x,; (b) The inverse function x, versus u. 
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Observation. In the special case where X:N(0,1) with CDF Fsy(z), we use the symbol 
zy (or 100 x z,) to denote the u-percentile of X. If X:N(,07) then the u-percentile of X, 
Ly, 1s related to z, according to 

Ly = Lt Zy0. (6.8-1) 


Example 6.8-1 
(relation between x,, and z,,) We wish to show that x, = t+ 2,0 if X:N (1,07). We proceed 
as follows: 

We write 


The last line is Fsn(zu), the CDF of the standard Normal RV. Hence 2, = p+ Zu0. 
We can use this result in the previously mentioned IQ problem. From the data we have 
Fx (at) = 0.93 = Fsn(zu). We can find z, from tables of the Normal CDF, or from 
tables of the error function (erf(zu) = F'sn(zu) — 0.5) we get that z, * 1.48. Then with 
Ly = U+ 2%,7=100+1.48 (10), we get that a 93 percentile in the IQ is 115. 


The Median of a Population Versus Its Mean 


The median of the population X is the point 29.5 such that Fx (xo.5) = 0.5. This is to be 
contrasted with the mean of X, written as jrx, and defined as py = f° xfx(x)dx. The 
median and mean do not necessarily coincide. For example, in the case of the exponential law 
where fx(x) = Ae~** u(x), we find that wy = 1/2 but x.5 = 0.69/A. To compute the mean 
of X we need fx(a), which is often not known. The mean may seem like a rather abstract 
parameter while the median is merely the point 29.5 where P[X < xo.5]. However, given n 
i.i.d. observations X1,X29,...,Xp on X, we estimate try with the mean estimator function 
(MEF) fix = n~' 3>\_, X;, which happens to be an unbiased and consistent estimator for 
the mean of many populations. Indeed it is the simple form of the MEF jix and the fact 
that if o% is finite that fix — wy for large n (see the law of large numbers) that make the 
mean so useful in many applications. Realizations of the MEF are intuitively appealing as 
they give us a sense of the center of gravity of the data. 


+When the event {X = 29.5} has zero probability, the events {X < 29.5} and {X > 29,5} are equally 
probable at 0.5. This gives rise to the often-heard statement that the median “is the point at which half the 
population is below and half above.” But as the median is the 50th percentile, it includes the probability 
of the event {X = xo.5} and the statement should be modified to “the median is the point at which half 
the population is at or below.” The median is a parameter that characterizes the whole population. The 
median of a random sample is only an estimate of the true median. 
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Example 6.8-2 
(median salary versus mean salary) Consider a country where half the workers make $10,000 
per year or less and half make more. Then we can take $10,000 as the median annual income. 
Now suppose that among those making $10,000 or less per annum, the numerical-mean 
annual income is $8000 while for those making more than $10,000 per annum, the numerical- 
mean annual income is $100,000. The numerical mean income for the country as a whole in 
this case is $54,000. In your judgment, which of these figures describes the economy of the 
country better? Which of these figures would you use to put the country in a good (bad) 
light? 


Example 6.8-3 
(median and mean are not the same for the binomial) We make the somewhat trivial obser- 
vation that in the binomial case the mean and median do not coincide. For example, with 
n = 5, the mean is 2.5 but the median, such as it is, is 2. However, when n is large, the 
median and mean approach each other and the median can be estimated by the mean. 
Indeed stated without proof, the difference between the mean and median is proportional 
to (p(1 — p))”, which becomes arbitrarily small for n — 00. 


Parametric versus Nonparametric Statistics 


The situation where we know or assume a functional form for a density, distribution, or 
probability mass function and use this information in computing probabilities, estimating 
parameters, and making decisions is called the parametric statistics. Typically, in the para- 
metric case, we might assume a form for the population density, for example, the Normal, 
and wish to estimate some unknown parameter of the distribution, for example, the mean 
jtx. In Chapter 7 we make extensive use of parametric statistics in hypothesis testing. 
Much of parametric statistics is based on the Central Limit Theorem, which states that the 
distribution of the sum of a large number of i.i.d. observations tends to the Normal CDF. 
The estimation of the properties and parameters of a population without any assump- 
tions on the form or knowledge of the population distribution is known as distribution-free 
or nonparametric statistics. Statistics based only on observations without assuming under- 
lying distributions are sometimes said to be robust in the sense that the theorems and 
conclusions drawn from the observations do not change with the form of the underlying 
distributions. Whereas the mean and standard deviation are useful in characterizing the 
center and dispersion of a population in the parametric case, the median and range play a 
comparable role in the nonparametric case. To estimate the median from X,, X2,..., Xn, we 
order them by magnitude as Yj < Yo< ... <Y;,, and estimate x9.5 with the sample median 
estimator 
Yr41 ifn is odd, that is,n = 2k+4+1, 


Yos = ee + Yeu1) if n is even, that is, n = 2k. (pen2) 


The sample median is not an unbiased estimator for xo.5 but becomes nearly so when n 
is large. The dispersion in the nonparametric case is measured from the 50 percent range, 
that is, Axo.50 - 0.75 — 0.25, OF the 90 percent range, that is, Ax0.90 4 %0.95 — £09.05 OF 
some other appropriate range. These have to be estimated from the observations. 
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Index value 
r 10 


” 
oo 


Vi V2 V3 Va Vs Ne V7 Va ¥y Yio 


Ordered samples 


Figure 6.8-2 Estimated percentile range from ten ordered samples showing linear interpolation between 
the samples. To get the estimated percentile, take the index value and multiply by 100/11. Thus, to a 
first approximation, the 90th percentile is estimated from y;. while the 9th percentile is estimated from 
y,. An approximate 50 percent range is covered by y, — yp. 


Example 6.8-4 
(interpolation to get percentile points) Using the symbol a ~ 3 to mean a estimates (3, we 
have Y3 ~ %o.273, Y4 ~ “0.364 and using linear interpolation 


(¥4 — ¥s)(0.8 — 4/11) 
1/11 


Y4+ ~ 20,3 


Interpolation between samples is shown in Figure 6.8-2. 


Confidence Interval on the Percentile 


We discuss next a fundamental result connecting order statistics with percentiles. Once 
again the model is that of collecting a sample of n i.i.d. observations X,, X2,...,X, ona 
RV X with CDF Fx (x). We recall the notation P[X; < 2, 2 u. Next we order the samples 
by signed magnitude to get Yj < Yo <--- < Y,. To remind the reader: if a set of realizations 
of the X;,2 =1,...,5, are v1 = 7, vo = —2, v3 = 7.2, ry = 1, v5 = 3 then the associated 
realizations on the Y;,i = 1,...,5, are yy = —2,y2 = 1,y3 = 3,y4 = 7, y5 = 7.2. From the 
subscripts on {Y;} we can make an obvious but remarkable statement on the {X;}, namely 
that the event {Y, < x,} implies that there are at least k of the {X;} that are less than 
Xy; there may be more but certainly not less. Then, because the {X;} are i.i.d., we can use 
the binomial probability formula to compute P[Y; < xy] as 


PIY; < ty] = Plat least k of the {X;} are less than x,| 


=Eta (Pwo -w Oe) 
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Next consider the event {Y,4, > %,}. Since Y,4,. is the (k +1r)th element in the ordering of 
the {.X;}, there are at least n—(k+r)+1 of the {X;} that are greater than x,. Equivalently 
there can be no more than k +r —1 of the {X;} less than x,,. Then 


PlYp4r > %y] = P{no more thank +r-—1 of the {X;} are less than x,| 


= (7 )wa ~u)ri, (6.8-4) 


a 


The intersection of the events {Y,4, > Yu} and{Y, < x} is the event {Y, < ay < Yp4r}. 
Its probability is 


PIYn <@u <Yetr] = > (7 ua =u)" (6.8-5) 


i=k a 


and is independent of fx(x). The result given in Equation 6.8-5 is one of the major results 
of nonparametric statistics and has important applications as we illustrate below. 


Example 6.8-5 
(sample size needed to cover the median at 95 percent confidence) We seek the end points 
Yi, Y, of a random interval [Yi,Y,,] so that the event {Y, < 20.5 < Y,} occurs with proba- 
bility 0.95. Here Y; 2 min(X, Xo,...X,), Yq & max(X1, Xo, ... Xn). In effect, how large 
should n be? 

The answer is furnished by computing 


n—-1 


Py ne < y= >) (7 aren = 0.95 


i=1 
and find that for n = 5, P[Y, < xo.5 < Ys] ¥ 0.94. The probability that the random interval 


[Yi, Yn] covers the 50 percent percentile point is shown in Figure 6.8-3 for various values of n. 


Probability that random interval covers the median 
1:2 


covered 
Oo 
oO 


Probability that median is 


Oo 


0 2 4 6 8 10 
Sample size 


Figure 6.8-3 Probability that the event { Y1 < x05 < Yn} covers the median for various values of n. 
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Probability that the 33rd percentile 
point is covered by the kth 
adjacent ordered pair 


123 45 67 8 9 
kth ordered pair 


Figure 6.8-4 Among the pairwise intervals [Y;, Y«+1], the interval [Y3, Ya] is most likely to cover 
X0.33- 


Example 6.8-6 
(between which pair of ordered samples does xo.33 lie?) We have a set of ordered samples 
{¥1, Yo,...,¥n} and wish to find the pair {Y;, ¥i41, 7 = 1,...,n — 1} that maximizes the 
probability of covering the 33.33rd percentile point. The 33.33rd percentile point is defined 
by u = 1/3 = Fx (xo.33). For specificity we assume n = 10. From Equation 6.8-5 we compute 
10! 
k1(10 — k)! 
and plot the result in Figure 6.8-4. Clearly the interval [Y3, Y4] is most likely to cover 29.33. 
The probability of the event {Y3 < xo.33 < Y4} is 0.26. 


PUY, < 20.33 < Yeqil = s(i/3)* (2/3)°*,k=1,...,9 


Confidence Interval for the Median When n Is Large 


If n is large enough so that the Normal approximation to the binomial is valid in distribution, 
we can use 


Pla < Sp < jez” exp -37 Jw (6.8-6a) 
where 
Bot ti 
Plas S,<l= 5 (7) x 0—, 

—np—0.5 
Soap 05 (6.8-6b) 

np(1 — p) 

8 A B-—np+0.5 

" np(1 — p) 


To apply these results to the problem at hand, we write 


PLY, <%0.5 < Yard] = am G (1/2), (6.8-7) 
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where we used that, by definition of the median, u = Fy(xo.5) = 1/2. The choice of 
subscripts will ensure that the confidence interval will begin at the rth place counting from 
the bottom, that is, from one, and end at the place reached by counting r observations back 
from the top. For example if the 95 percent confidence calculation for n = 10 yields r = 3, 
the confidence interval begins at the third observation and ends at the eighth observation, 
both points reached by counting three places from bottom and top, respectively, that is, 
1, 2, 3 (Y3) and 10, 9, 8 (Yg), and the result would appear as P[Y3 < 205 < Ys| 
= 0.95. 

In the binomial sum in Equation 6.8-7 we note that its mean is n/2 and its standard 
deviation is \/n/2. Hence the Normal approximation to the binomial sum in Equation 6.8-7 
for a 95 percent confidence interval is 


pan Jar) /2)” ~z f° exp[-52 Idx = 0.95, 


which, from the tables of the standard Normal distribution (or the error function), yields 
Qn = —1.96, 0,, = 1.96. Then it follows from Equation 6.8-6b that 


n—-r—n/2+0.5 
1.96 = 
7 Vn/2 
ee ee n/2— ue 


Vn? 


which yields r = (n/2) — 1.96./n/2 + 0.5. If r is not an integer replace r by [r], where the 
latter is the largest integer less than or equal to r. 


Example 6.8-7 
(95 percent confidence interval for the median for n=20) We make 20 observations on an 
RV X and label these {X;, 7 = 1,...,20}. We order them by signed magnitude so that 
Y, < Yo <-++ < Yy. We use r = (n/2) — 1.96,/n/2 + 0.5 to obtain r = 6.12 and |r| = 6. 
Then PLY6 <%05 < Yi5] > 0.95. 


6.9 ESTIMATION OF VECTOR MEANS AND COVARIANCE MATRICES? 
Let Xj 4 ‘Cre Key be a p-component random vector with pdf fx (a). Let X;,...,X, be 


n observations on X, that is, the X;,i = 1,...,n are drawn from fx(x). Then Xj,i = 1,...n 
are i.i.d. random vectors with pdf fx (a;). We show below how to estimates 


(i) py 2 BEX] = (uy,---,4,)7, 


+ This section and the next one can be omitted on a first reading. 
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where 


and 


(ii) Krex  B[(K - nx)(K - Bx)" ). 


The vector and matrix parameter Hy and Kxx are useful in many signal processing 
applications. They also figure prominently in characterizing the multi-dimensional Normal 
distribution [6-5]. The covariance matrix Kxx is most often a full-rank, positive-definite, 
real-symmetric matrix. The properties of such matrices are well-known [6-6] and can be 
exploited in their estimation. 


Estimation of 


Consider the p-vector estimator © given by 


iP 
Rocce 


Dx (6.9-1) 


We shall show that © is unbiased and consistent for pu. We arrange the observations as in 
Table 6.9-1. 

In Table 6.9-1 Xj; is 7th component of the random vector X;. The components of the 
vector Y;,j =1,...,p are ni.i.d. observations on the jth component of the random vector 
X. From the scalar case we already know that 


6 A 
= xu Sa, j=1,- (6.9-2) 


Table 6.9-1 Observed Data 


Xy... Xj... Xn 
Xi Xi Xni 
Y, 
: Xi; p rows 
Yj 
: Xip Xnp 
Y> 


The components of Yj; are all that 
is necessary for estimating the jth 
component, j;, of the vector p. 
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is unbiased and consistent for j; SE [X;;] i= 1,--- ,n. It follows therefore that the vector 
estimator © 4 (O1,--- ,O,)7 is unbiased and consistent for 4. The vector Y; contains all 
the information for estimating 4;. Thus, E[Y;] = 4,7, where 7 = Oy er ewe eal aan 

When X is normal, © is normal. Even when X is not normal, © tends th the normal 
for large n by the central limit theorem (Theorem 4.7-1). 


Estimation of the covariance K 
If the mean p is known, then the estimator 
082K, — w(K 0)" (6.9-3) 
w=1 


is unbiased for K. However, since the mean is generally estimated from the sample mean ji, 
it turns out that the estimator 


I> 


62 - ame - HT (6.9-4) 


is unbiased for Kxx. To prove this result requires some effort. First observe that the diagonal 
elements of © are of the form 


: ee ‘ 2A 
which we already know from the univariate case are unbiased for o% — 


consider the sequence (1 4 m) 


Xq, + Xm, Xai + Xoms°** XnitXnms (6.9-6) 


which are n i.i.d. observations Z (i) on a univariate RV Zim, = X,+ Xm with mean pj; + Um, 


lm? 
and variance 
Var[Zim] = E[(X1 — my) + (Xm — Hdl 
=o7 +05, +2Kim (6.9-7) 
where Ky, 2 E [((X7 — £;)(Xm — [m)] is the Imth element of Kxx. Finally, consider 


uk 


Oim = 
n—-1 


So [Zim — (fa + fom))?s (6.9-8) 
al 
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which, by Equation 6.8-15, is unbiased for 07 + 07, + 2Kim. If we expand Equation 6.9-8 
and use the fact that Zp, = X,+ Xm, we obtain 


x A Ll n 5. 
Om = | S(Xa fy) + (Xim — pe) 
i=1 
1 n m 
= Xj — f;)* <a Xim p 
oe . : 
+ a xa = fi) Nam = fg): (6.9-9) 
i=1 


In Equation 6.9-9, the first term is unbiased for 07, the second is unbiased for o?,, and the 
sum of all three is unbiased by Equation 6.9-8 for or +07, +2Kim. We therefore conclude 
that 


A 
Sim = 


: So (Xa — [)(Xim — fn) (6.9-10) 


n—-14 


is unbiased for Kyn(= Km). Hence every term of © in Equation 6.9-4 is unbiased for every 


corresponding term in Kxx. In this sense 64 Kxx is unbiased for Kxx. 

By resorting again to the univariate case and assuming that all moment up to the 
fourth order exist, we can show consistency for every term in the estimator for Kxx, that 
is Equation 6.9-4. Hence without specifying the distribution, Equations 6.9-1 and 6.9-4 are 
unbiased and consistent estimators for ux and Kxx respectively. 

When X is normal, Kxx obeys a structurally complex probability law called the 
Wishart distribution (see 6-6, p. 126). More generally, when the form of the pdf of X 
is known, one can use the maximimum likelihood method of estimatiing such parameters 
as o%, x and Kxx. Maximum likelihood estimators have several, but not all, desirable 
properties as estimators. The next example shows that the MLE for the mean is not a 
minimum-square estimator. 


Example 6.9-1 
({6-5], p. 21.) Consider the sample mean estimator from Equation 6.8-3, that is, 


We recognize that this estimator is the MLE for the mean pu’ Now we ask: what constant 


a in the scalar estimator © 4 aj. will generate the MMSE estimator of ju? Recall the 
X;,i=1,-:: ,n are iid. RV with E[X;] = p and Var[X;] = 07. 


Solution we are seeking the value of a such that 


Elajp — ps)? (6.9-11) 
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is a minimum. Clearly ji is unbiased for jz, and it seems hard to believe that there may 
exist. an © with a £1 that—though yielding a biased estimator—gives a lower MSE than 
O =u. 

For any estimator O, the mean square error in estimating ju is 


E[(© — »)”] = El{(O — E[O}) + (E[6] — 4) })] 


A 


= Var[O] + (E[O] — 1). (6.9-12) 


If © is unbiased then the last term, which is the square of the bias (Definition 6.8-2), is 
zero. For the case at hand, © = aju; thus 


A 


B(© — 11)?] = aVar(fi] + (aw — 1)? 


azo? 


== +(a— 1)? py. (6.9-13) 


To find the MMSE estimator, we differentiate Equation 6.9-13 with respect to a and set to 
zero. This yields the optimum value of a = ag, that is, 


_  w_ : 
to earyaeae GAA ea! ed 


6.10 LINEAR ESTIMATION OF VECTOR PARAMETERS? 


Many measurement problems in the real world are described by the following model: 
y(t) = i h(t, 7)0(r)dr + n(t), (6.10-1) 
T 


where y(t) is the observation or measurement, T is the integration set, 0(7) is the unknown 
parameter function, h(t,7) is a function that is characteristic of the system and links the 
parameter function to the measurement but is itself independent of 0(7), and n(t) is the 
inevitable error in the measurement due to noise. For computational purposes Equation 
6.10-1 must be reduced to its discrete form 


Y=HO+N, (6.10-2) 
where Y is an n x 1 vector of observations with components Y;, i = 1,...,n. H isa 
known n x & matrix (n > k),@ is an unknown & x 1 parameter vector, and N is an 
n xX 1 random vector whose unknown components N;,i = 1,...,n are the errors or noise 


associated with the ith observation Y;. We shall assume without loss of generality that 
E(N] = 0.4 


+ This section can be omitted on a first reading. 
?The symbol 0 here stands for the zero vector, that is, the vector whose components are all zero. 
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Equation 6.10-2 is known as the linear model. We now ask the following question: How 
do we extract a “good” estimate of @ from the observed values of Y if we restrict our 
estimator © to be a linear function of Y? By a linear function we mean 


© =BY, (6.10-3) 


where B, which does not depend on Y, is to be determined. The problem posed here is of 
practical significance. It is one of the most fundamental problems in parameter estimation 
theory and covered in great detail in numerous books, for example, Kendall and Stuart [6-8] 
and Lewis and Odell [6-9]. It also is an immediate application of the probability theory of 
random vectors and is useful for understanding various topics in subsequent chapters. 

Before computing the matrix B in Equation 6.10-3, we must first furnish some results 
from matrix calculus. 


Derivative of a scalar function of a vector. Let g(x) be a scalar function of the vector 


x = (1,.--,2%n)*. Then 
dq(x) a ( og ce ie (6.10-4) 


dx Ox,’ Oxy, 


Thus, the derivative of q(x) with respect to x is a column vector whose ith component is 
the partial derivative of q(x) with respect to 2;. 


Derivative of quadratic forms. Let A be a real-symmetric n x n matrix and let x be 
an arbitrary n-vector. Then the derivative of the quadratic form 


q(x) S xT Ax 
with respect to x is 
dq(x) 
= 2Ax. .10- 
ay x (6.10-5) 


The proof of Equation 6.10-5 is obtained by writing 


n n 


(=) > oat 


i=1 j=1 
n n n 
= Li Ai + AjjLiL;. 
i=1 Ifj 
Hence 
q(x) 


= 2rpAKR +2 y; Agi; 
iZk 


n 
=2 y ApiLi 
21 


OxrK 
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or 
dq(x) 
dx 
Derivative of scalar products. Let a and x be two n-vectors. Then with y = ax, we 
obtain 


= 2Ax. (6.10-6) 


dy _ 


=a (6.10-7) 


Let x, y, and A be two n-vectors and an n x n matrix, respectively. Then with q = y’ Ax, 


Oq 


= A’y. (6.10-8) 
We return now to Equation 6.10-2: 
Y =HO+N 
and assume that (recall E[N] = 0) 
K 2 E[NN™] = 071 (6.10-9) 


where I is the identity matrix. Equation 6.10-9 is equivalent to stating that the measurement 
errors N;, that is, 2 = 1,...,n are uncorrelated, and their variances are the same and equal 
to o?. This situation is sometimes called white noise. 

A reasonable choice for estimating @ is to find a © that minimizes the sum squares S 
defined by 


s£ (vy —H6)T(Y — H6) 4 |/Y — HO|/?. (6.10-10) 


Note that by finding © that best fits the measurement Y in the sense of minimizing 
|[Y — HO||?, we are realizing what is commonly called a least-squares fit to the data. 
For this reason, finding © that minimizes S in Equation 6.10-10 is called the least-squares 
(LS) method. It is a form of the MMSE estimator. To find the minimum of S with respect 
to O, write 
S=Y'TY +6 H’H6 -6'H’y -Y7HO 

and compute (use Equation 6.10-4 on the LHS and Equations 6.10-5 and 6.10-8 on the 
RHS) 

OS _ 

ao 


whence (assuming H7H has an inverse) 


0 = 2/H7 HJO — 2H’ Y, 


Ors = (H"H)“1HTY. (6.10-11) 


Comparing our result with Equation 6.10-3 we see that the B in Equation 6.10-3 that 


furnishes the least-squares solution is given by Bo 4 (H7H)~'H7. Equation 6.10-11 is the 
LS estimator of 8 based on the measurement Y. 
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The astute reader will have noticed that we never involved the fact that K = o7I. 
Indeed, in arriving at Equation 6.10-11 we essentially treated Y as deterministic and merely 
obtained O75 as the generalized inverse (see Lewis and Odell [6-9, p. 6]) of the system of 
equations Y = H@. As it stands, the estimator O;,5 given in Equation 6.10-11 has no claim 
to being optimum. However, when the covariance of the noise N is as in Equation 6.10-9, 
then ©, does indeed have optimal properties in an important sense. We leave it to the 
reader to show that ©; is unbiased and is a minimum variance estimator. 


Example 6.10-1 
We are given the following data 


6.2 =304+n1, 
7.8=40+n0, 


Find the LS estimate of 6. 
Solution The data can be put in the form 
y = H@+n, 
where y = (6.2, 7.8, 2.2)” is a realization of Y, H is a column vector described by (3, 4,1)" 


and n = (ni,n2,n3)" is a realization of N. Hence HTH = YH? = 26 and H7y = 
3_, Ay: = 52. Thus, 


Example 6.10-2 
((6-8, p. 77.]) Let @ = (0;,02)7 be a two-component parameter vector to be estimated, and 
let H be a n x 2 matrix of coefficients partitioned into column vectors as H = (HiH2), 
where H;, 7 = 1,2 is an n-vector. Then with the n-vector Y representing the observation 
data, the linear model assumes the form 


Y = (H|H2)0+N 
and the LS estimator of @ is 
: ee aa a bee 


LS. = 
H7H, HF He Bly 
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SUMMARY 


In the branch of statistics known as parameter estimation we apply the tools of probability 
to observational data to estimate parameters associated with probability functions. We 
began the chapter by stressing the importance of independent, identically distributed (i.i.d.) 
observations on a random variable of interest. We then described how these observations 
can be organized to estimate parameters such as the mean and variance, with emphasis 
on the Normal distribution. The problem of making “hard” (i.e., categorical) statements 
about parameters when the number of observations is finite was resolved using the notion 
of confidence intervals. Thus, we were able to say that based on the observations, the 
true mean, or variance, or both had to lie in a computed interval with a near 100 percent 
confidence. We studied the properties of the standard mean-estimating function and found 
that it was unbiased and consistent. 

We found that the t-distribution, describing the probabilistic behavior of the T random 
variable, was of central importance in constructing a confidence interval for the mean of a 
Normal random variable when the variance is unknown. 

In estimating the variance of a Normal random variable, we found that the Chi-square 
distribution was useful in constructing a near 100 percent confidence interval for the vari- 
ance. We briefly discussed a method of estimating the standard deviation of a Normal 
random variable from ordered observations. 

We demonstrated that confidence intervals could also be developed for parameters of 
distributions other than the Normal. This was demonstrated with examples from the expo- 
nential and Bernoulli distributions. 

A method of estimating parameters based on the idea of which parameter was most 
likely to have produced the observational data was discussed. This method, called maximum 
likelihood estimation (MLE), is very powerful but does not always yield unbiased or minimum 
mean-square error estimators. 

Toward the end of the chapter we introduced nonparametric methods for parameter 
estimation. These methods, also called distribution-free estimation, do not assume a specific 
distribution for generating the observational data. In this sense they are said to be robust. 
We found that a number of important results in the nonparametric case could be obtained 
using ordered data and the binomial distribution. 

Finally, we extended the earlier discussions on parameter estimation to the vector case. 
In particular, we showed how the elements of vector means and covariance matrices could 
be estimated from observational data. A brief discussion of estimating vector parameters 
from linear operation on measurement data completed the chapter. 


PROBLEMS 


(*Starred problems are more advanced and may require more work and/or additional 
reading.) 


6.1 We make n (even) i.i.d. observations on a Normal random variable X. We label them 
{X;,i = 1,...,n}. Now we rearrange the sequence as follows: Yo; = X9;-1, Yoi-1 = 
Xo;,i=1,...,n/2. Are the random variables in the {Y;} sequence independent? 


PROBLEMS 385 


6.2 


6.3 


6.4 


6.5 


6.6 


6.7 


6.8 
6.9 


6.10 


6.11 


6.12 


We have three i.i.d. observations on X:N(0,1). Call these X;, i = 1, 2,3. Compute 
fx, X5X5(1,%2,%3) and compare with fx,4x,+x,(y). 

In a village in a developing country, 361 villagers are exposed to the Ebola Gay 
hemorrhaging-fever virus. Of the 361 exposed villagers, 189 die of the virus infection. 
Compute a 95 percent confidence interval on the probability of dying from the Ebola 
virus once you have been exposed to it. What is the margin of error? 

Show that the roots of the polynomial (p — p)? — (9/n)p(1 — p) = 0 that appeared 
in Example 6.1-6 are indeed as given in Equation 6.1-1. 

Referring again to Example 6.1-6, compute |p; — p2| as p varies from zero to one. 
Do this for different values of n, for example, n = 0, 20,30, 50. 

Describe how you would test for the fairness of a coin with a 95 percent confidence 
interval on the probability that the coin will come up heads. 

Consider the variance estimating functions in Equations 6.3-3 and 6.3-4. Show that 
for values of n > 20, the difference between them becomes extremely small. Repro- 
duce the curve shown below. 


Difference between variance 
estimating fuctions versus 
sample size n 
0.6 
0.5 4 
0.4 5 
0.3 4 
0.2 4 
0.14 


0 -—. 


0 20 40 
Sample size 


Difference 


Compute P[|jix (nm) — py| < 0.1] as a function of n when X : N(1, 1). 

Plot the width of a 95 percent confidence interval on the mean of a Normal random 
variable whose variance is unity versus the number of samples n. 

Show that the MGF of the gamma pdf 


f(@;0,8) = (a18°*P) a exp(-2/8),2 > 0; a > -1,8>0 


is M(@) = (1—pi)-@, 
We make n iid. observations X; i = 1,...,n on X : N(u,07) and construct Y; = 


Xx; —_ n . 
“iF Use the result of Problem 6.10 to show that the pdf of W,, 4 ya 8 


a 
x’, with n degrees of freedom that is, 
fu (Win) = ((n/2) — 112/222) exp(— (1/2), w > 0. 


We make n i.i.d. observations X; on X:N(1, 07) and construct fp = n7' S*"_, X; and 
a? = (n—1)-1 0"_, (X; — fx)?. Show that i and 6? are independent. (Hint: It helps 
to use moment generating functions; if all else fails consult Appendix F). 
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6.13 


6.14 


6.15 


6.16 
6.17 
6.18 


6.19 


6.20 


6.21 


6.22 


6.23 


a 
Let X:N(,07) and W,,:x2 be two independent RVs. With Y ees 
o 
a) Show that the joint dencity of Y and W,, is given by: 
1 W("-2)/2 exp(—0.5w) 
,w) = —~exp(—0.5y”)x ,-00 <y <oo,w > 0; 
A [((n — 1) /2]! il 
b Let T = show that tn) = x : 
) Wn fr( ) nm(n — 2) /2]! [1 + #2 /n]+))/2 
—oo <t< oo. This the “Student’s” t-pdf. (Hint use a proper two variable-to- 
two variable transformation.) 
Wm 
Let W,, and V, be independent X23 XE respectively. Then F’ = im has the 
F-distribution with m,n degrees of freedom respectively. Show that T? has the 
F-distribution with one and n degrees of freedom. 
Use Matlab™, Excel™, or some other scientific computing program to create a 95 
percent confidence interval for the mean of a Normal random variable X : N(0, 1). 
Use 50 observations per single interval computation and repeat the experiment 50 
times. For each experiment record the length of the interval and whether it includes 
the mean, which in this case is zero. Repeat for 100 observations per interval compu- 
tation. 
Show that the sample variance in Equation 6.2-3 is unbiased. 
Show that the sample variance in Equation 6.2-3 is consistent. 
Show that the one-shot estimation of ox by Equation 6.3-7 has variance 
Var(é) = (F = 1) o% ~ 0.5702. 
The following Normal data are observed: 
3.810e + 0 2.550e+0 —1.150e+0 —1.230e+0 5.640e — 1 
—1.420e—-2 —2.370e+0 2.500e + 0 1.130e+0 —1.670e +0 
—3.000e + 0 1.730e + 0 7.040e — 1 9.680e — 1 9.630e — 1 
Find a 95 percent confidence interval for the variance o% of the distribution. 
Show that the number a in Equation 6.6-6 is a = 2(145)/2, -@ = 2(1~5)/2, that is, a 
is the (1+ 6)/2 percentile of the Z:N(0, 1). 
Use the estimator in Equation 6.4-12 to estimate the variance of 20 Gaussian random 
numbers drawn from a N(0,2) random number generator. Use the same data to 
estimate the variance using the unbiased variance estimator function of Equation 
6.4-2. 
Let X1,X2,X3 be three observations on X:N(jry,0%). Let V; = Aivfxln) for i= 
1,2,3. Show that = V7 is Chi-square with two degrees of freedom. 
Using a Gaussian random number generator select 20 numbers from X:N(0,2) and 


compute a 95 percent confidence interval for the variance. Get an approximate solu- 
tion by dividing the 5 percent error equally on either tail of the Chi-square pdf. 
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6.24 


6.25 


6.26 


6.27 


6.28 
6.29 


6.30 
6.31 


6.32 


6.33 


Use a Gaussian random number generator to generate 20 numbers from a X:N(0, 1) 
distribution. Call these numbers {x;,1 = 1,...,20}. Repeat the process and obtain 
a new set of 20 numbers and call these numbers {y;,1 = 1,...,20}. Create two sets 
of numbers {w;,1 =1,...,20} and {v;,1 =1,...,20}, where w; = 24+ (y;/V2) and 
vj = x; — (y;/V2). The underlying random variables W, V would have a correlation 
coefficient pyy = 1/3 (prove this!). Compute the numerical sample correlation 
coefficient py y- 

Show that the covariance estimating function of Equation 6.3-1 is unbiased and 
consistent. 

Find the mean and variance of the random variable X driven by the geometric 
probability mass function Px(n) = (1—a)a”u(n). Compute a 95 percent confidence 
interval on the mean of X. 

—P 


Vpa/n 

with P[(p — p)? < a*pq/n| = 6. Justify this claim. 
Show that the MLE for the mean enjoys the property of squared error consistency. 
Compute the MLE for the parameter \ in the exponential pdf. Show that the like- 
lihood function is indeed a maximum at the MLE value. 
Compute the MLE for the parameter p Sp [success] in the binomial PMF. 
Compute the MLE for the parameters a, b (a < b) in fx(x) = (b—a)7! (u(a — a)— 
u(a — b)). 
[6-2] Consider the linear model Y = Ia + bx + V, where 

Voie 

Va(\,...,Va)? 


Tonxn identity matrix 


In Example 6.5-3 the claim is made that P |—a < 


< ‘ = 6 is identical 


XS (Lint. Py) 

a= (Oscix5 a)? 
and a,b are constants to be determined. Assume that the {V;, 7 = 1,...,n} are n 
iid. Normal random variables as N(0,07), the x;,i =1,...,n are constant for each 
i=1,...,n, but may vary as 7 varies. They are called control variables. 


(ii) Write the likelihood function and argue that it : maximized when )7i"_, (Yi- 
(a+ ba;)° is minimized; 

(iii) Show that the MLE of a is Gz = fiy — byt% and the MLE of 6 is by = 
doins (ti — £)Y; 


“where 


diai (wi — 8)?’ 


(i) Show that {Y;,i=1,...,n} are N(a+ baj, 07); 


a & n 

y = (1/n) ia Ni 
E = (1/n) rin Gi. 
Assume that the weight of college football players follows the Normal distribution 
with mean 220 lbs and standard deviation of 20 lbs. Determine the 95" percentile 
of the weight random variable. 
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6.34 Compute the median for the geometrically distributed RV. 
6.35 Compute the mean and median for the Chi-square random variable. 
6.36 Show that an estimate for the 30th percentile, xo.3, is given by the interpolation 


formula Y4 + ae ~ £0.3, Where the {Y;, i = 1,...,10} are the ordered 


random variables formed from the set of unordered i.i.d random variables {X;, i = 
1,..., 10}. 


6.37 How large a sample do we need to cover the 50th percentile with probability 0.99? 


Hint: Use the formula P[Y < 20.5 < Yn] = aa, @) (1/2)” = 0.99, 


where Y; & min(X1, X9,...Xn), Yn & max(Xy, Xo,...Xn). 


*6.38 Show that the joint pdf of the ordered random variables Y;,Y,, where Y; 4 


min(X1, X2, ...Xn), Yn 2 max(X1, Xo,...X,), is given by 


Fry, (Yin) = n(n — 1) (Fx (Yn) — Fx(yi))" fx (U1) fx (Yn) 00 << Yn < 00 


Hint: Consider the joint pdf of all the Y,, Y2,--- , Y, and integrate out all but the 
first and last. 


*6.39 Let {Y;, i = 1,...,n} be a set of ordered random variables. Define the range R of 


the set as RS Y, — Yi. Now consider six observations on X{X;,i = 1,...,6} 
from the pdf fx(x) = u(a) — u(# — 1), where u(x) is the unit-step function. 
Show that fr(r) = 30r4(1 — r),0 < r < 1. Hint: Use the result fy,y,(y1,9n) = 
n(n — 1) (Fx (yn) — Fx(y1))"? fx (yr) £x (Yn), -00 < y1 < Yn < 00, and define two 
random variables R & Y, -¥1,S = Y, and find frg(r,s). Then integrate out with 
respect to S. 
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Statistics: Part 2 
Hypothesis Testing 


Hypothesis testing is an important topic in the broader area of statistical decision theory. 
Statistical decision theory also includes other activities such as prediction, regression, game 
theory, statistical modeling, and many elements of signal processing. However, the ideas 
underlying hypothesis testing often serve as the basis for these other, typically more 
advanced, areas.t 

Hypotheses take the following form: We make a hypothesis that a parameter has a 
certain value, or lies in a certain range, or that a certain event has taken place. The so-called 
alternative hypothesis? is that the parameter has a different value, or lies in a different range, 
or that an event has not taken place. Then, based on real data, we either accept (reject) the 
hypothesis or accept (reject) the alternative. Parameter estimation and hypothesis testing 
are clearly related. For example, the decision to accept the hypothesis that the mean of one 
population is equal to the known mean of another population is essentially equivalent to 
estimating the mean of the unknown population and deciding that it is close enough to the 
given mean to deem them equal. 

In the real world we often are forced to make decisions when we don’t have all the facts, 
or when our knowledge comes from observations that are inherently probabilistic. We all 
(probably) know heavy smokers who live well into their eighties and beyond. Likewise, we 
know of nonsmokers that die of lung cancer in their fifties. Does this mean that smoking 
is unrelated to lung cancer? In days of old, the chiefs of tobacco companies said yes while 


+ There are several textbook references for this material, for example [7-1] to [7-4]. 
!The alternative hypothesis is often called, simply, the alternative. Thus, one encounters 
hypothesis. .. versus the alternative. ...” 


“we test the 
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cancer epidemiologist said no. In view of all the evidence accumulated since then, no reason- 
able person would now argue that smoking does not increase the likelihood of dying from 
lung cancer. Nevertheless, unlike what happens when a person falls off a 20-story building 
onto concrete, death by lung cancer or other smoking-related disease does not always follow 
heavy smoking. The relationship between smoking and lung cancer remains essentially prob- 
abilistic. In the following sections we discuss strategies for decision making in a probabilistic 
environment. 


7.1 BAYESIAN DECISION THEORY 


In the absence of divine guidance, the Bayesian approach to making decisions in a random 
(stochastic) environment is, arguably, the most rational procedure devised by humans. 
Unfortunately, to use Bayes in its original form requires information we may not have 
with any accuracy or may be impossible to get. We illustrate the application of Bayesian 
theory and its concomitant weakness in the following example. 


Example 7.1-1 
(deciding whether to operate or not) Assume that you are a surgeon and that your patient’s 
x-ray shows a nodule in his left lung. The patient is 40 years old, has no history of smoking, 
and is otherwise in good health. Let us simplify the problem and assume that there are only 
two possible states: (1) The nodule is an early onset cancer that without treatment will 
spread and kill the patient and (2) the nodule is benign and doesn’t pose a health risk. We 
shall abbreviate the former by the symbol ¢, and the latter by ¢,. The reader will recognize 
that the outcome space (read sample space) Q has only the two points, that is, Q = {¢,,¢5}, 
but—in more complex situations—could in fact have many more. The surgeon’s job is to 
make that decision (and take subsequent action) that is best for the patient. The trouble 
is that without an operation the surgeon doesn’t know the state of nature, that is, whether 
¢, or ¢, is the case. There are two terminal actions: operate (a1) or don’t operate (a2). 

It is not always clear as to what “best” means. However, it seems quite reasonable, other 
things being equal, that “best” in this case is that decision/action that will minimize the 
number of years that the patient will lose from a normal lifetime. There are four situations 
to consider: 


(1) The surgeon decides not to operate and the nodule is benign; 
(2) The surgeon decides not to operate and the nodule is a cancer; 
(3) The surgeon operates and the nodule is benign; 
(4) The surgeon operates and the nodule is a cancer. 
Prior data exist that lung nodules discovered in nonsmoking, early middle-age males are 
benign 70 percent of the time. Thus, the probability that a nodule is cancerous for this 
group is only 0.3. The surgeon is also aware of the data in Table 7.1-1. 

The terms {I(ai,¢;), 7 = 1,2;7 = 1,2} are called loss functions and I(aj,¢,;) is the 
loss associated with taking action a; when the state of nature is ¢;. The reader might 


ask why I(a1,¢,) = U(a1,¢)) = 5 and not zero. Surgeons know that operations are risky 
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Table 7.1-1 


Then the number of years 
And the state of | subtracted from a normal 
If the decision is nature is life span is I(a,¢) 


Don’t operate (action a2) Benign lesion (¢,) I(a2, C2) =0 


Don’t operate (action a2) Cancer (¢,) i(a2,¢,) = 35 
Operate (action a1) Benign lesion (¢,) (a1, C2) =5 
Operate (action a1) Cancer (¢,) W(a1,¢,) =5 


procedures and that even healthy patients can suffer from post-operative infections such 
as from MRSA? and gram-negative bacteria’. Unless absolutely necessary, most surgeons 
will avoid major invasive surgery in preference to non-invasive procedures. Thus, due to 
infections and other complications any surgery carries a risk and, counting the people who 
die from surgical complications, we assign an average loss of five years. 

Next, we introduce the idea of a decision function d. The decision function d is a 
function of observable data so we write d(X1, X2,...X,), where the {X;,i = 1,...,n} are 
n ii.d. observations on a random variable (RV) X. The decision function d(X1, X2,...Xn) 
helps to guide the surgeon with respect to what action, that is, a, or a2, to take. In our 
example we limit ourselves to a single observation that we denote X, specifically the ratio 
of the square of the length of the boundary of the nodule to the enclosed area. This is a 
measure of the irregularity of the edges of the nodule: The more irregular the edges, the 
more likely that the nodule is a cancerous lesion (Figure 7.1-1). Thus, we expect that most 
of the time the RV X for the cancerous lesion (¢,) will yield larger realizations than those 
yielded by X for the benign case (¢,). A realization of X in this case is the datum. 


(a) (b) 


Figure 7.1-1 (a) A benign lesion tends to have regular edges; (b) a cancerous lesion tends to have 
irregular edges. 


+Methicillin-resistant Staphylococcus aureus. 
?These bugs prevail in hospitals and cause infections that are difficult to treat. 
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F(%5 63) Poss) 


Cc x 


Figure 7.1-2 There is a value of c (to be determined) that will minimize the expected risk. A datum 
point in the region Is = (—co, c] is more likely to be associated with a benign condition and will lead 


' : ; ar A . : F , 
to action a2 (don’t operate), while a datum point in I’; = [c, 00) is more likely to be associated with a 
cancer and will lead to action a; (operate). 


Let f(a;¢,) and f(a; ¢,) denote the pdf’s of X under conditions ¢, and C5, respectively 
(see Figure 7.1-2). In this example we assume, for simplicity and ease of visualization, that 
these pdf’s are unimodal and are continuous. Further, as shown in Figure 7.1-2, we assume 
that there exists a constant c such that if the datum falls to the right of c it will be taken as 
evidence that the opacity is a cancer. Likewise, if the datum falls to the left of c it will be 
taken as evidence that cancer is not present. If the evidence suggests a cancer then action 
a, follows; else, action ag follows. Since this is a probabilistic environment errors will be 
made. Thus, 


Plaltal= [ fles¢e)ar (7.1-1) 


is the error probability that the evidence suggests there is a cancer that requires an operation 
when in fact there is no cancer. Likewise 


PlaalGal =f Flescuax (7.1.2) 


is the error probability that the evidence suggests there is no cancer and therefore the action 
is not to operate while in fact there is a cancer. 

The conditional expectation of the loss when the state of nature is ¢ and the decision 
rule is d is called the risk R(d;¢). Thus, 


Rd; C1) = War; 61) Plar|Ca] + laa; 61) Plaalei] 
and (7.1-3) 
Rd; C2) = U(a1; C2)Plai|Ca] + laa; 62) Plaa|¢a)- 


Finally, the expected risk, labeled B(d) defined ast 


B(d) = R(d; Cy) PlC = Gi] + RUG; Ca) PIC = Co], (7.1-4) 


+The symbol B is used in honor of the mathematician/philosopher Tomas Bayes (1702-1761). 
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is the quantity to be minimized. A decision function d* that minimizes B(d) is a Bayes 
strategy. 
Thus, 


B(d") = min {RUE CPE = Ci) + Rd; 2) P[¢ = cal} (7.1-5) 


The probabilities 
A A 
P= PiC=¢)), =P =] 
are called the a priori probabilities of the state of nature. In terms of the symbols introduced 
above, we can write B(d) as 


B(d) = Pi x I(az,¢y) +P, x I(a2, Co) 


Co 


Tr i {Pate [?(@1, C2) — Uae, ¢2)) — Prf(@s ¢1) (Maa, 61) — ar, ¢1)] han 

(7.1-6) 
where we choose c to minimize B(d). If the integral in the expression for B(d) is positive, 
it will add to B(d), but if the integral is negative, it will reduce B(d). Indeed if we choose 
c, say c = c*, so that (c*,oo) leaves out all the points where the integral is positive but 
includes the points where the integral is negative, then we have minimized B(d). Outcomes 
(read events) that make the integral negative are described by 


F(X3G) — [la1,¢2) — (a2, 6o)] Pe a i 
(OG) Mexico ven 


which is the Bayes decision rule. It says that for all outcomes! (c*,oo) take action a; 
(operate). Likewise for all outcomes (—oo, c*), that is, 


FIAGGy) 
f(X3 C2) 


take action a2 (don’t operate). The constant c is the point that satisfies 


Fle" Gi) _ 

FlesG2) 
The prior probabilities in this example would be computed from aggregate information on 
thousands of patients who sought help for similar symptoms. The nodule observed in a 40- 
year, nonsmoking male is more likely to be benign than cancerous; for example, it might be 
a harmless opacity, some residual scar tissue, or even the intersection of blood vessels giving 
the appearance of a lesion. For simplicity we shall assume that we know these probabilities 
as P; = 0.7, Po = 0.3. Then specializing Equation 7.1-6 for this case yields 


Bld) = 105+ f° B.5fles62) —9f (161) de, 


tRecall that under the mapping of the (real) RV X, events are intervals on the real line. 
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which implicitly yields the constant c* from 3.5f(c*;¢5) — 9f(c*;¢,) = 0. Then the Bayes 
decision rule is 

F(X361)/F(X3 Cg) > 0.39 — operate 

ff (X30) /f (X53 Gg) < 0.39 — don’t operate. 


In Example 7.1-1 only a single RV was used in making the decision. In many problems, 
however, a decision will be based on observing many i.i.d. RVs. In that case the Bayes 
decision rule takes the form 


F(X15 Cr) + FXns G1) [1(a1, C2) — U(a2, ¢2)] Pa A ; 
fia) FXnita) ” War, ts)— Kar, Cy)}Pr MOONY Sa 98 State of nature 


F(X1501) + F(X C1) 2 [1(a1, C2) — Uae, C2)| Pe 
F (X13 Ca) ++ F(Xns Ce) ~ [Maas ¢1) — Har, ¢,)] Ph 


4 ky, reject ¢, as state of nature. 
(7.1-8) 


The reader will recognize that the numerator and denominator in Equation 7.1-8 are the 


likelihood functions L(¢;) = [] fx(Xi:¢;), 7 = 1,2 discussed in Chapter 6. Therefore 
i=1 


Equation 7.1-8, being a ratio of two likelihood functions (the likelihood ratio) that is being 
compared to a constant, is quite appropriately called a likelihood ratio test (LRT). The 
constant k, in Equation 7.1-8 is called the Bayes threshold. 

Every Bayes strategy leads to an LRT but not every LRT is the result of a Bayes 
strategy. The Bayes strategy seeks to minimize the average risk but other LR-type tests 
may seek to abide by different criteria, for example, maximizing the LR subject to a given 
probability of error. One problem with implementing the Bayes strategy is that the a priori 
probabilities P, and P2 are often not known. Another problem is that it may be difficult to 
assign a reasonable “loss” to a particular action. For example, say that you are preparing a 
large omelet and need to break a dozen eggs. You are thinking of using a Bayes strategy to 
minimize the loss, that is, the amount of work you have to do. Your choices are to use one 
bowl or two bowls and the random element here is whether an egg is good or bad. Suppose 
that, on average, for every 100 good eggs there is one bad egg. If you use only one bowl 
and a bad egg is added to the others before you realize that it is bad, then you have ruined 
the whole mixture. If you use two bowls, a small one in which you inspect the contents of 
a newly broken egg before adding it to the other eggs, and a large one containing all the 
(good) broken eggs, then you avoid ruining the mixture if the egg is bad. Now, however, you 
have two bowls to wash instead of one when you are finished. How would you reasonably 
define the loss in this case? While this example is perhaps not terribly serious, it illustrates 
one of the problems associated with trying to apply the Bayes strategy. Another problem is 
that it may be difficult to estimate prior probabilities for rare events. For example, suppose 
a country wants to use its antimissile resources against an attack by a hostile neighbor. 
If the defense strategy is designed according to a Bayes criterion, knowledge of the prior 
probability of an enemy attack is needed. How would one estimate this in a reasonable way? 
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7.2 LIKELIHOOD RATIO TEST 


Because prior probabilities are often not available and loss functions may not be easily 
defined, we drop the constraint on minimizing the expected risk and modify the Bayes 
decision rule as 


f(X15.01) f(Xn3 G1) 
F(X; Cy), F(Xni Cy) >k, accept ¢, as state of nature oa 
fee ri <k, reject ¢, as state of nature, 


where the threshold value k is determined from criteria possibly other than that of Bayes. 
Common criteria are related to the probabilities of rejecting a claim when the claim is 
true and/or accepting the counterclaim when the counterclaim is true. This kind of test is 
known as a likelihood ratio test that tests a simple hypothesis (the claim) against a simple 
alternative (the counterclaim). To save on notation we define the LRT random variable as 


A> Fs) FXniO)/ Fi) fai ey) 
= L(¢1)/L(¢2) 


(7.2-2) 


and illustrate its application in an example. 


Example 7.2-1 
(testing a claim for a food) Consider a health-food manufacturer who claims to have devel- 
oped a snack bar for kids that will reduce childhood obesity. The snack bar, while tasty, 
supposedly acts as an appetite suppressant and thereby helps reduce the desire for fattening 
in-between-meals snacks such as potato chips, hamburgers, sugar-sweetened soda, chocolate 
bars, etc. To test the validity of this claim, we take n children (a subset of a large, well- 
defined group) and give them the weight-controlling snack bar. After one month, the average 
weight for this group is 98 lbs with a standard deviation of 5 lbs. The other children in the 
group, that is, the ones not taking the weight-controlling snack bar, average 102 lbs with 
a standard deviation of 5 Ibs. We make the hypothesis* that the weight-controlling snack 
bar has no effect in controlling obesity. This is called the null hypothesis and is denoted 
by Hy. The alternative, denoted by Ho, is that the weight-controlling snack is helpful in 
controlling obesity.’ It does not matter which hypothesis we designate as H, but once the 


+Obesity among children is a severe problem in the United States. Extrapolated from the present rate 
of caloric consumption, it is predicted that in 2020 three out of four Americans will be overweight or obese 
(Consumers Reports, December 2010, p. 11). 

The meaning of this word is: an assumption provisionally accepted but currently not rigorously 
supported by evidence. A hypothesis is rejected if subsequent information doesn’t support it. 

8In many books the null hypothesis is denoted by Ho and the alternative is denoted by Ha. We prefer 
the numerical subscript notation. 
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choice is made, we are required to be consistent throughout the problem. In the absence of 
well-defined loss function, we focus instead on the probabilities of error. We define 

af P\based on our test we decide that H2 is true|H, is true] 

gs P\based on our test we decide that Hy is true|H2 is true]. 


With X;,71 = 1,...,n, being iid. RVs denoting the weights of the n children, we assume 
that the weights of both groups are Normally distributed near their means,' that is, 


We note that fx(a;,H,) (fx(a;,H2)) is the pdf of X;, i =1,...,n, under the condition 
that H)(H2) applies. 
Then from Equation 7.2-2, 


A =exp (GE (258) (21)']) 


Further simplification yields 


An. 
A = Ky exp (Fax(n)) ‘ 
where jix(n) Sn} > X; and K,, is a constant independent of the {X;,i = 1,...,n} but 


dependent on the sanpie size n. The decision function then becomes 


4 
if K, exp (Seitx(n)) > ky, accept Hy, (reject H2) 


4 
if K, exp (Seitx(n)) < kn, accept Ho, (reject H;). 


Since the natural logarithm of A(InA) is an increasing function of A (Figure 7.2-1), we can 
simplify the decision function using (natural) logs and aggregating various constants into a 
single one. Then the test becomes 


if fix(n) > cn, accept Hy 
if fix (n) < cp, accept Ao, 


where c, is another constant that depends on the number of children in the test n, and is 
determined by the criterion we impose. If Hj is true then jfiy(n) is N(102,25/n) that is, 


+The Normal characteristic is taken to be valid around the center of the pdf, say, a few o values on either 
side of the mean. It definitely is not valid in the tails. For example, what would you make of a “negative 
weight”? 
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In(x) versus x 


In(x) 


x 


Figure 7.2-1 The natural logarithm of x is an increasing function of x. 


1 1 (e= 100)" 
TE) Tea (-3 | a ) 


while if Hp is true fiy(n) is N(98,25/n) that is, 


1 1 fx—98]? 
a(v, Ho) = ex : 
aa a r( Apel 
The pdf’s of jiy(m) under H, and Hp are shown in Figure 7.2-2. 


Suppose by way of a criterion we specify a = 0.025. Recall that a a7 [accept that Ho is 
true|H, is true]. Then 


Cn — 102 
5//n 


which, from the Normal tables and simplifying, gives a threshold value c, = 102—(9.8/,/n). 
As elsewhere the symbol F'gy(z) stands for the CDF of the standard Normal RV evaluated 
at z. Thus, acceptance of Hy requires the event {102 — (9.8/\/n) < jix(n) < oo}. This 
can also be written as (102 — (9.8/,/7), oo) since intervals on the real line are events under 
RV mappings. The influence of the sample size on the threshold is shown in Figure 7.2-3. 
The power of the test increases with increasing sample size, as shown in (Figure 7.2-4). 
Increasing power means that the probability of making an error when Hg is true decreases. 


0.025 = f fi(v, Hi) dx = Fy (en) = Fon ( ) = Fn (20.025); 


The error probability @ is called the probability of a type I error and the significance 


level of the test. The probability P =e G is called the power of the test and (3 itself is 
called the probability of a type II error. The power of the test is the probability that we 


+The error probability a is sometimes called the size of the test. 
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Figure 7.2-2. The pdf's f(x, Hi) and f(x, H2) for Example 7.2-1. 


Threshold versus sample size for a = 0.025 
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Figure 7.2-3 As the sample size increases, the threshold value moves to the right. 


reject the null hypothesis given that the alternative is true. In general, it is not possible to 
make both a and (7 extremely small even though it is not true, in general, that a+ 9 = 1. 

With reference to Example 7.2-1 we address a question some readers might have 
regarding this discussion, namely since the children eating the weight-controlling snack bar 
average 4 lbs less weight than their counterparts, why not simply accept this as evidence 
that the weight-controlling snack bar works? This would ignore the fact that even in the 
heavier group of children, a weight of 98 lbs is within one standard deviation from the mean 
of 102 lbs, meaning that if the sample size is small we could be in error in concluding that 
the snack bar is useful. Moreover, such a naive approach would tell us nothing about the 
probability that we are mistaken. 


Example 7.2-2 
(difference of means of Normal populations) In some nutritional circles, there is a belief that 
bringing aid to Third-World malnourished children by way of a diet rich in omega-3 oils (e.g., 


400 Chapter 7 Statistics: Part 2 Hypothesis Testing 


Power of LRT for a = 0.025 


Power 
i=] 
fo) 
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Figure 7.2-4 The power of the test increases with increasing sample size, which is a good thing. The 
best test would maximize the power of the test for a given n and a. In this example, the test is indeed 
the best test. 


fish) and complex carbohydrates (whole wheat, bran, brown rice, etc.) can increase a child’s 
IQ by 10 points by age 13, besides improving health. To test such a claim, one might want to 
measure the IQ of children brought up on such a diet against the IQ of children brought up on 
the local diet. Typically the data would be the sample mean ji(n) of the IQs of the n children 
fed the experimental diet. If we denote the true but unknown mean by yp the test might take 
the form Hy : yg = 110 versus A : 4jq = 100. There are several variations on this type of 
test, for example, Hy : uy =a versus Hp: pA aand H,:a< p< b versus Hp: <a,p> b. 
We consider the elementary test H, : ps = b versus Hp : pp = a(b > a) for a Normal 
population with, say, variance 0”. We assume a random sample of size n, meaning that we 
have n iid. RVs Xy,...,X,. Then if H, is true X;:N(b, 07) while if Hp is true X;:N(a, 07). 
The LRT random variable is 


TI (2x0?) ~”? exp (—3 [25=4]’) 
A= . (7.2-3) 
I (2702) 1/? a (- [<=*]’) 


which, after simplifying, taking logs, and aggregating constants, yields the test 


ju(n) > Cn, accept Hy (reject H,) 
ju(n) <n, reject Hy (accept H2). 


The constant c, is determined by our choice of a. For example with a = 0.025, we must 
solve 


a = Placcept H2|H, true] = 0.025 


- [aptar oe (10 [58)) 


f 


= 1. 1/(27)°° exp (—1/2y’) dy = Fsn (= 


) = F5n(Z0.025); 
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where ci, 2 (Cn — b),/n/o. From the tables of the standard Normal CDF, we find that 
20.025 = —1.96. Solving, we get cy, = b — (1.960/./n). Notice the similarity between this 
example and Example 7.2-1. The power of the test is 


P=1- Placcept H,|H2 true] 


= 1- (200? /n)~ 1? [ exp (os =a]) dz (7.2-4) 


(° -—a- a 
aja 

The reader will recognize that the power of the test is simply the probability of accepting 
H2 when Ho is true. Returning to the IQ problem that motivated this discussion, we find 
that for a = 0.025, b = 110,a = 100, 0 = 10, and n = 25, the acceptance region for Hy is 
the region to the right of c, = 106.1. In other words, when the event {106.1 < ji(n) < oo} 
occurs, it suggests that a good diet helps to overcome the IQ deficiency of malnourished 
children. The power of the test is 0.999. 


Neyman-Pearson Theorem. Suppose we are asked to find a test for a simple hypothesis 
versus a simple alternative that, for a given a, will minimize (@. Such a test will maximize 
the power P = 1 — 3 and is therefore a most powerful test. What is this test? The Neyman— 
Pearson theorem (given here without proof) furnishes the answer. 


Theorem 7.2-1 Denote the set of points in the critical region by Rg (i-e., the region 
of outcomes where we reject the hypothesis H,). Denote the significance of the test as a 
meaning Placcept H2|H; is true] < a. Then Ry, maximizes the power of the test P f= B 


if it satisfies Mat jefe) 
A A. me ee ni G4 : 7.2-5 
F(X1, 02) +++ F(Xns Co) 7 | : 


for some fixed number k, which determines Ry. 


Discussion. The Neyman-Pearson Theorem (NPT) says that the likelihood ratio test, 
subject to the above constraints, that is, at significance a, is the most powerful test. In this 
sense it is an optimal test. The relationship between R,z, k, and a is not explicitly stated 
by the theorem but becomes clear in working a problem. 


Example 7.2-3 
(chicken feed for making large eggs) A producer of chicken feed claims that a new product 
“Eggrow,” when fed to chickens, will cause the laid eggs to be larger than those laid by 
chickens fed ordinary feed. With ordinary feed, the chickens raised by this producer lay 
eggs that on the average weigh 60 grams per egg, with a standard deviation of 4 grams. 
Twenty-five chickens fed on “Eggrow” produce eggs whose average weight is 62 grams witha 
standard deviation of 4 grams. Let the hypothesis be Hy, : w= 4, = 62 and the alternative 
be Hz : w = pt, = 60. The significance level of the test is 0.05. According to the NPT, 
the test 
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TI (2716)-"/? exp (-3 [52]’) 


i=l 


that defines the critical region R, is the most powerful test. Then A = 
exp (= + (60)? — (02)?) and taking logs, aggregating constants, and simplifying, yields 


the test 
if (& < cp, reject H,, accept H2 
if (4 > cy, accept H,,reject Ao, 


where c, is an unknown constant. To find c,, and the rejection region Rx, we solve 


0.05 = [ Faas exp ( 3 (; a2!) ) de 


and find that c, = 60.7 and Ry = (0,60.7). Thus, if ji < 60.7, reject Hi, accept Hz. The 
test is most powerful and P + 0.81. 


7.3 COMPOSITE HYPOTHESES 


In the previous section we mentioned that in practice there are tests of the form: H, : a < 
pu < b versus Hz: pp < a,p > b and others. All of these tests have one thing in common: 
either H, or Hz or both deal with events whose sample space has many outcomes. In the 
case of the simple hypothesis versus the simple alternative, the sample space had only two 
points ¢, and ¢,. In the case of composite hypotheses, the test 


A f(X1,61)-++ f(Xni Gr) 
Bese a ) oe (7.3-1) 


has no meaning because there are many more ¢’s than just ¢, and ¢,. To understand the 
material in this section, the reader should recall that in the estimation of parameters by the 
maximum likelihood method (MLM) the idea was to find the parameter 6 in the likelihood 
function L(@) that was most likely to have yielded the observed result. Often this could 
be found by differentiation but not always. In the problems discussed so far, there was no 
need to maximize the likelihood function to find the most likely 6 because @ had one of 
two values, either ¢,; or ¢y. Suppose that the parameter of interest is the mean, that is, 
6 = p. Then in a problem such as Hy : 4 = fg versus He : uw A Uy the maximization of the 
likelihood function associated with Hz requires searching for the optimum value of jz in the 
parameter space (—co,0o). In other words, while the hypothesis in this case is simple, the 
alternative is not: It is said to be composite. 

Fortunately not all composite hypothesis problems require such a search. We can still 
use the Neyman—Pearson rule and its desirable most-powerful property. We illustrate with 
an example. 
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Example 7.3-1 
(testing the hypothesis H, : w= up, versus the alternative Hz : u < ,) We assume a Normal 
population with mean y and variance o?. At first glance it would seem that the likelihood 
function associated with Hp : fs < fs, requires a search. However, we can reduce this 
problem to a simple hypothesis versus a simple alternative by a slight modification of the 
H» hypothesis. That is, we modify the problem to Hy : yu = 4, versus HH: w= py < fy, 
where jly is as yet arbitrary. Then 


A = exp (-d: (>: (Xs — py)? - +; (A= 1) <k (7.3-2) 


t=1 i=1 


is the LRT for the critical region for Hy. Simplifying, taking logs, and aggregating all 
constants, we obtain the test: if {1 < cy, reject H,. To find the constant c, we proceed as 
before; that is, we use the type I error criterion, that is, the significance level of the test. 
Thus, say, with a = 0.01 and the pdf fa(z;,) = N(“,,07/n) we solve 


oor = ef” oo |- 05 (Sf) lac, 


to obtain cy = pi, — 2.320/./n. Thus, we reject Hy if fi < py — 2.320/,/n. Note that we 
never had to specify an actual value for pl. 


Generalized Likelihood Ratio Test (GLRT) 


The GLRT is useful for solving composite hypotheses problems. First, recall that some 
likelihood functions are functions of one parameter, some of two parameters, etc. For 
example, the likelihood function associated with an n-sample of i.i.d. exponential RVs is 


L(A) = X" exp(—A > X;)u(X;)' and is a function only of the parameter 6 = \, while the 
likelihood function eu with an n-sample of i.i.d. Normal RVs is 


sno) = neyo (435 [AP 


41 


and is a function of two parameters 0 = (y,¢0). The likelihood function associated with 
a two-dimensional (multivariate) Normal would be a function of five parameters, that 
is, [1,/!9,01,02,P 9. We use the notation L(@) to indicate a likelihood function of the 
parameters 0 = (01,02,...,0%). Now consider the following problem: Let © denote the 
global k-dimensional parameter space; for example, in the univariate Normal this would 
be 0 = (—o0o < pt < c,0 < o < ow). Let ©; denote the parameter space (a subspace 
of ©) associated with the hypothesis H,. For example if X:N(py,0%) and the hypothesis 


+The function u(z) is the unit step:u(x) = 1,2 > 0, and zero elsewhere. 
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is Hj: 3 < ux <4, then 0; = (3 < py <4, 0 < 0% < ow). Define the test statistic A for 
testing H, : 8 € ©; versus the alternative Hz: 6 ¢ ©; as 


A Lym (@") 
Lem (6")’ 


where Lr(0") 4 maxgco, L(@) and Leu (6") 4 maxgee L(A). We may ask why A, as 
given in Equation 7.3-3a, is a reasonable test statistic for accepting or rejecting H,. First 
recall that maximizing the numerator gives us the most likely parameter estimates, restricted 
to 01, to account for the observations. Because our search is restricted to 01, the maximum 
in this parameter subspace may not be a global maximum; hence we call it a local maximum. 
Next, maximizing the denominator gives us the most likely unrestricted parameter estimates 
that account for the observations; hence we call it a global maximum. The subscripts LM 
and GM are there to remind the reader of the “local-max” and “global-max” operations, 
respectively. We observe that A is a random variable with its realization confined to [0,1]. 
(Question for the reader: Why is this so?). Now if the realizations of A are close to one, 
then we assume that Hy is true; that is, the unknown parameters are in 0, but, in fact, are 
also the most likely parameters in the whole space. On the other hand, if the realizations 
of A are small or close to zero we may assume that the most likely parameters are not 
in 0,. The threshold value c denotes the point at which we go from accepting (rejecting) 
the hypothesis to accepting (rejecting) the alternative H2 and is usually determined by 
enforcing the significance level a. In summary then, the GLRT is described as 


reject Hy if A<c, (7.3-3b) 


(7.3-3a) 


where A is given in Equation 7.3-3a. It has been shown that under certain conditions, the 
GLRT is asymptotically optimal in the Neyman-Pearson sense. However there exist counter- 
examples in the literature that prove that the GLRT is not always optimal [7-22]. In this 
sense it must be regarded as being empirical. 

We illustrate the application of the GLRT with several examples involving continuous 
distributions. 


Example 7.3-2 
(testing Hy : w=, versus Ho: uw A py when X is Normal and o” known) We make n 
observations on a Normal RV with known variance o?. The likelihood function is 


E(u) = (2na?)-"/? exp (—32y 32 (Xi — 1?) 


; : (7.3-4) 
= (2n0?)-"/? exp (si = [ix - ji)? + (a «) 7 


To go from line 1 to line 2 we generated some cross-terms in the argument of the exponent 
that vanish in the summation. We leave the algebraic steps as an exercise to the reader. 
Since o? is specified, the space © is (—oo < pp < 0). Then Lay (p") is obtained when 
wu! = fi, that is, Lew(ut) = L(t), and since ©; contains only one point it follows that 
Lyu(p*) = L(,). From Equation 7.3-3a we get 


L(y) aan 2 
A= Toh =e (—gpa(A- 1)) (7.3-5) 
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and the critical region is associated with outcomes of jj that are far from j4. When ji is 
near j4,, A will take values near 1 and we would tend to accept H,. Likewise when ji is far 
from 4, it is unlikely that Hy is true and we reject it. Somewhere in between is a constant 
c such that 0 < A < c describes the critical region. Taking natural logs, we find that the 
critical region is defined by 


ju > wy + (202 In(1/c)/n) 


fi < py — (207 In(1/c)/n)"”?, 


where c is determined by the significance level a of the test. 


(7.3-6) 


Example 7.3-3 
(numerical realization in Example 7.3-2) Here we obtain a numerical evaluation Equation 
7.3-6. Assume that yw, = 5,07 = 4,n = 15, and a = 0.05. With f(x; ,) denoting the 
pdf of A we must compute 0.05 = i fa(x)dxz. But the event {A < c} is identical to the 
event{—oo < InA < Inc}, which in turn is identical to {-2Inc < —2lInA < oo}. From 


i, 2 
f =) , which is x? with one degree of freedom, that is, y? 
a/J/n 


the subscript indicates the degree of freedom). Denoting the y2 pdf by f,2(x;n) we write 
n x 


Equation 7.3-5, —2IlnA = ( 


0.05 = i fa(a)dx = / f(z; 1)da = 1 — Fya(—2Inc; 1). 
0 —2loge 

From the tables of the CDF of the x7 RV we obtain —2Inc = 3.84. Hence from Equation 

7.3-6 we determine the critical region as jf > 6.01, fi < 3.99 or, as interval events mapped 

by j (—00, 3.99) U (6.01, 00). 


Example 7.3-4 
(testing the telephone waiting time when the call is in a queue) A call to the Goldmad 
Investment Bank (GIB) gets an automatic (robotic) operator that announces that during 
business hours the average waiting time to speak to an investment consultant is less than 30 
seconds (0.5 minutes). We wish to test this claim using the GLRT. We make n calls to the 
GIB during business hours and record the waiting times X;,i = 1,...,Xn, assumed to be 
i.i.d. exponential random variables each with pdf fx, (x; 4) = (1/) exp (—2/p) u(x), where 
n 


pp = E(X;),i = 1,...,n. From basic probability we know that fi = (1/n) S> X; is an unbi- 


=1 

ased, consistent estimator for uw. We test the hypothesis Hy : yw < 0.5 versus Ho: yu > 0.5. 

The likelihood function is L(y) = (1/w)”" exp (-2 lu/n > x) = (1/p”) exp (—nfi/p). 
i=l 

Then Ley (u") is obtained by differentiation with respect to ys to obtain 


Lem (u") = L(t) = - exp(—n). 


Finding Lyjy(u*) is a little more sophisticated. To illustrate what is going on we plot two 
likelihood functions in Figure 7.3-1, one that peaks at the mean of 0.45 and another that 
peaks at the mean of 0.55. The ys space 0; = (0, 0.5] is based on our hypothesis that 4 < 0.5 
and includes the global maximum point 0.45. However, when the likelihood function is the 
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Likelihood functions for different means 
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Figure 7.3-1 The upper curve is the likelihood function when the true mean is at 0.45 and n = 10. 
The lower curve is the likelihood function when the true mean is at 0.55 and n = 10. The subspace 
© 1 = (0,0.5] includes the point 0.45 (shown as the dotted line on the left of the solid line) but not the 
point 0.55 (shown as the dotted line on the right of the solid line). 


lower curve in Figure 7.3-1, which peaks at «4 = 0.55, the local maximum is not the same 
as the global maximum since the point 0.55 is not in 0; = (0,0.5]. 
Hence 
+) f a exp(—n), jt < 0.5 
Lom(H!) = a exp(—2nju), fu > 0.5. 


The subspace 0; = (0, 0.5] includes the point 0.45 (shown as the dotted line on the left 
of the solid line) but not the point 0.55 (shown as the dotted line on the right of the solid 
line). 

Finally, from Equation 7.3-3a, we get 


AA Lim(*) 


Lem(u") (7.3-7) 


z 


1 
{ ay exp (—na -1), 


The critical region is the interval (0, c’); that is, all outcomes A € (0,c’) would lead to the 
rejection of H,. The critical region is shown in Figure 7.3-2: On the A axis it is below the 
horizontal line at c’; on the ji axis it is to the right of f= c. 

Because the likelihood function decreases monotonically with fi in the region fi > 0.5 
(Figure 7.3-2), we can use fi as a test statistic. Assuming that n is large enough for the 
Normal approximation to apply to the behavior of ji, at least where the pdf has signifi- 
cant value, that is, within a few sigmas around its mean, we have ji : N(,u?/n) since 
ji is an unbiased estimator for yw. In writing this result we recalled that the variance of 

2 


a single exponential RV is pz? and therefore the variance of 7 is = u?/n. We create 


the approximate standard Normal RV from Z S (ju — 41)\/n/p and compute c from the 


p< 0.5 
ju > 0.5. 
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Figure 7.3-2 Variation of the GLR test statistic with the sample mean estimator. 
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Figure 7.3-3 As a increases, the cut-off point decreases thereby increasing the width of the critical 
region. 


significance constraint a. Using the percentile notation 1— a = Fy(z1-q), we find that 
c= + 2 ~op/Vn, from which we see that the critical point c increases linearly with wy. 
We reject the hypothesis when j > c. For example with p = 0.5,a = 0.05, and n = 10 we 
find that zo.95 = 1.64 and c = 0.76. As a increases, the cut-off point decreases toward 0.5 
(Figure 7.3-3). 


Example 7.3-5 
(evaluation of cancer treatment by the drug Herceptin) Newer treatments for cancer involve 
disabling the proteins that fuel cancer. For example, some breast cancers contain a protein 
called HER2. In such cases, the drug Herceptin is partially effective in treating the cancer 
in that it reduces the cancer recurrence by 50 percent.’ Tumors that do not exhibit HER2 
have better prognoses than those that do. Since Herceptin has significant toxic side effects, 
it is important that the test for the HER2 protein is accurate but this is not always the 
case. Let Hy: tumor has a high level of HER2 and therefore will respond to Herceptin, and 


+“Cancer Fight: Unclear Tests for New Drug,” New York Times, April 20, 2010. 
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let Hz: tumor has low levels (or none at all) of HER2 and therefore the patient should not 
be given Herceptin. It is estimated that in current testing for HER2: 

Pidecide H, is true|H2 is true] = 0.2 

P{decide Hy is true|H, is true] = 0.1. 
Hence, the tests have a significance level of 0.1 and a power of 0.8. 


How Do We Test for the Equality of Means of Two Populations? 


Assume that there is a drug being tested for androgen-independent prostate cancer.' The 
drug is administered to a group of men with advanced prostate cancer. Does the drug extend 
the lives of the participants compared with those of men taking the traditional therapy? 
A printing company is evaluating two types of paper for use in its presses. Is one type of 
paper less likely to jam the presses than the other? The Department of Transportation is 
considering buying concrete from two different sources. Is one more resistant to potholes 
than the other is? Some of these problems fall within the following framework. We have two 
populations, assumed Normal, and we have m samples from population Pl and n samples 
from population P2. Is the mean of population Pl equal to the mean of population P2? In 
general, this is a difficult problem, essentially beyond the scope of the discussion treated 
in this chapter. More discussion on this problem is given in [7-1]. However, when one can 
assume that the variance of the populations is the same, the problem is treatable analytically 
in a straightforward way. In preparation for discussing this problem, we review some related 
material in Example 7.3-6. 


Example 7.3-6 
(preliminary results for Example 7.3-7) We have samples from two Normal populations 
Sy = {Xy,;,i = 1,...,m} and Sy = {Xo;,i = 1,...,n}. The elements of S; are m iid. 
observations on X; with X1:N (1,07). Likewise, the elements of $2 are n i.i.d. observations 
on X_ with X:N(fl2,03). Further, assume that E[(X1; — 41)(X2; — f)] = 0, all i,j. 


(i) Assuming p14, = fy = pL, show that Elf, — fi,] = 0. 
Solution to (i) = E[f — fg] = E[Ay] — Ely] = —p=0. 

(ii) Assume that uy = 2 = pw and a, = og = o. Show that Var(fi, — jt,) 
n*)o?. 
Solution to (ii) Since E[~4] = Elfig| =, Var(fiy — fig) = E(t, — f)?] = Elf] + 
El jis] — 2E [ji jig]. Substitute 


Elp2] =m-? (% F(X2)4+ 50> zx.%,) 


i=l i=1 j#i 


fl 
= 
| 


n n n 
a2 —2 2 
Eljig] =n | D7 E(Xa) + Oe B%.%)] 
i=l i=l jHi 
E(Xj,) = w? + 07 = E(X3;) and 
~ A 2 
E (fy fz) = bh 
into the expression for the variance and obtain the required result. 
+ Androgen-independent means that the cancer is not fueled by testosterone. It is difficult to treat. The 


authors’ colleague, Prof. Nick Galatsanos, an important contributor to the science of image processing, died 
from this illness at the age of 52. 
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(iii) Show that if V and W are Chi-square with degrees of freedom (DOF) m and n, 
respectively, then U aig + W is Chi-square with DOF m+n. 
Solution to (iii) IfV: x2, and W: x? then V = 7", Y? and W = >y_, Z?, where we 
can assume that Y;,i = 1,...,m, are iid. N(0,1) and Z;,7 = 1,...,n, are ii.d. N(0,1). 
The MGF of V is My(t) = E[exp(tV] and is computed as 


My (t) = (2)-"? / ss 1 7 exp(t >" y?)xexp(-1/2-"" v2) TT, dui 


=T[[3 eny i | exp (-0.5(1 — 2t)y?) dy; 
= (1 —2t)-™/? for t < 1/2. 


Line 1 is by definition; line 2 is by the i.i.d. assumption on the Y;’s; and line 3 results from the 
total area under the Normal curve being unity. Because U = V+W and V and W are jointly 
independent, it follows from the discussion in Section 4.4 that My(t) = My (t)Mw(t). Since 
My(t) = (1—2t)-™/? and My(t) = (1—2t)-"/? it follows that My(t) = (1 — 2¢)—("+™/2, 
which implies that U : x?,4.,- 


(iv) Given the likelihood function L = (2%0?)~™/? exp[—0.5 37", ((Xi — 1)?/0?)] show 
that Lem = L(ji',67"), where, in this case, ji! = fu and 67! = 6? 


Solution to (iv) We obtain ji’ by differentiating In L with respect to p and obtain fi! = 


ji = (m)~! 3", Xj. Likewise, we obtain 6”! by differentiating with respect to 7? and obtain 
ett — g2 2m! ye (Gi - ji)’. Substituting into the expression for L we compute Lay as 
in m/2 
Lem = ( - ) core. (7.3-8) 
2m yin (Xi — f)? 


Example 7.3-7 
(testing Hy : bi= We versus Hg : wy # He,07 = 0% = O07 not known) As in Example 
7.3-6 we have samples from two Normal populations S$; = {X1;,i = 1,...,m} and S) = 
{Xoi,i =1,...,n}. The elements of $; are m i.i.d. observations on X1 with X1 : N(p1,07). 
Likewise, the elements of S2 are n i.id. observations on X2 with X2 : N(j,03). Further, 
assume that E[(X1; — p1,)(X2; — f2)] = 0, for all 7,7. We shall test H : 4, = jg versus 
Hz : 4 # My. The parameter space’ for Hy is @; = (p,07) while the global parameter 
space is © = ({1,, 2,07). The likelihood function is 


(m+n) /2 2 2 
_ 1 1 m Xi py 1 yy X24 — fly 
v= (sas) as (-3 ae (FS ) see (-3 ae (= 


+To avoid excessive notation we denote a parameter space such as © = {—o0 < py < 00,—00 < pn < 
00, 01 > 0,02 > O} by O = {p1, My, 01, 02} etc. for other cases, when there is no danger of confusion. Then 
the expression L(©) can be interpreted as the likelihood function of parameters in the space 0. 
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and from the results of (iv) in Example 7.3-6 we obtain 


aw Sa 


j 


1 -—4 
joy =n" Og Kot = Be 
T 


a2 . . a 
aia See Os (Xi a fy)? + see (X2; = jtz)*) =a. 
We insert these results in DL for 4, U2, and o?, respectively, to obtain 


male (m+n) /2 (m 4 n) 
Lam = ( ™ ~\9 Tr ~ \2 ) exp ("3") . 

2m (iat (Xai — fy)? + et (Xai — fig)?) 2 ( 
7.3-10 


Returning now to the likelihood function in Equation 7.3-9, we wish to maximize this in 
the parameter subspace ©,. Since in Hy 4, = Wy = pb, we rewrite D as 


(m+n) /2 2 2 
L(p,o) = (==) exp (-} Soe (a ) son (-4 at (= ) : 


(7.3-11) 
Straightforward differentiation with respect pz and o? yields ju* and o* as 
1 m n 
an 
CF an ee Xu t+ a X»,) 
__: Mt ee nn. 
= er a y man 


and 


Ox il m 7 n i mn " _ 
a" = (oe (Xu — fy)? + ae (Xai — fla)? 4 (Ay in)?) 


men m+n 


When ji* and 6** are substituted for js and o? in L(©,) of Equation 7.3-11, we obtain Lr 
as 


(m-+n)/2 
(m+n)e+ 
Lim ~ m A n A mn a me 
21 (ee (Xai — fy)? + ae, (Xai — fig)? + mn (ea a jtn)?) 
The likelihood ratio A 4 Limu/Lem is computed as 
4 Pe —(m+n)/2 
ay (ee jy)? 
— = - - (7.3-12) 
dein (Xai — fi)? + ina (X2i — fe)? 


From (ii) in Example 7.3-6, fi, — fiz is distributed as N (0,0?(m + n)/mn), so that 


A (fy — fiz) 
a= aa 
6 Cae) 


mn 
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A 


1 


2 2 


te m+n-2 


Figure 7.3-4 Critical region shown in heavy lines. It is easier to test H; versus Hz using a test on 
Teas than on A. 


A versus T 


-4 4 

Figure 7.3-5 Instead of doing the test on the GLR statistic, it is more convenient to do the test on the 
T statistic. See Equation 7.3-13. The critical region along the T-axis is shown in heavy lines. For the 
reader's interest, for this graph m = n = 10. The hypothesis is rejected if |7| > t., where t. depends 
on the type | error a. In a two-sided test at significance a we assign a/2 error mass to each half of the 
critical region, that is, P[T > t-] = a/2 and P[T < —t-] = a/2. 


is distributed as N(0,1). Likewise, 


fee “ 42 
A m (Xiu — fh n (Xo — ft 
Wmtn—2 — (=, (4 1) + Dos (4 2) 


is Chi-square with DOF m+n — 2 by (iii) of Example 7.3-6. Finally, recall that Tin4n—-2 = 


— is the t-distributed RV with DOF m+n -— 2 so that 
m+n—2 


A= (7.3-13) 


~(m+n)/2 
1+ (I? nal (m +n — ») 


Since A is a monotonically decreasing function of T?,,,, 5, the test can be made on T?,, ,,_5 
rather than on A. Then the critical region for H, of the form 0 < A < A, translates, when 
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the test is done on T?,,,,_9, as the critical region (2,00) (Figure 7.3-4) or, equivalently, as 
the union of the events (t¢,0o) and (—oo, —t.) (Figure 7.3-5). More information on this type 
of test, so-called t-test, can be found in [7-10] to [7-15] and/or on the Internet by entering 
t-test in Google or another search engine. 

Under the constraint of a type I error a@ we reject the hypothesis if the event 
{72> tt oat occurs, where t;_./2 is obtained from the t-distribution tables with m+n—2 
degrees of freedom using Fr (t;~9/2) = 1— a/2. 


Example 7.3-8 
(numerical example of testing Hy : wy= Up versus He: uy # bg) We call on a Gaussian 
random number generator (these are available on the Internet) and generate 15 samples from 
a N(0,2) population (P1) and 15 samples from a N(2,2) population (P2). We reproduce 
the numbers here: 


From population P1: $; = {2.21, 0.83, 0.393, 0.975, 0.195, —0.069, —1.91, 1.44, —3.98, 0.98, 
2.84, —1.56, —0.4, —1.08, 0.116}; fi, = —0.258; m = 15; 305°, (Xj, — a4)? = 40.48. 
From population P2: Sy = {—1.28, —0.258, —0.947, 5.85, 1.56, 1.48, 1.95, 3.22, 1.41, 1.84, 
2.69, 3.94, 2.04, 2.08, 1.44}; fi, = 1.801; n = 15; 37/2, (X4, — fib)? = 45.66. We insert the 
data in 


mn(m +n)" (jt, — fa)? 
Doin (Xi — fy)? + jaa (X2i — fie)? 
and obtain the realization for T? as 10.34. Finally with a = P(reject H,|H, true) = 0.01, 
we find that Fp(ti-a/2) = 1—0.005 = 0.995 with DOF of 15+ 15 —2 = 28. From the tables 
of the t-distribution we find that t;_./2 = 2.763 or tt ose = 7.63. Since T? > tt ojo we 
reject the hypothesis that the means are the same. 


T? 2 (m+n-—2) (7.3-14) 


Testing for the Equality of Variances for Normal Populations: 
the F-test 


Another problem we encounter is whether two Normal populations have the same vari- 
ance. The model is the following: We have two Normal populations P1, N(,,07), and P2, 
N({tg,0%), and collect m samples (i.c., we make m i.i.d. observations) S$; = {X1j,i = 
1,...,m} from Pl and n samples Sp = {Xo;,i = 1,...,n} from P2. Based on these 
samples we wish to test the hypothesis that H, : of = 03 © Gg? versus the alternative 
that Hz : of 4 o3. The parameter space for testing H; is 0, = {p1,U2,07} while the 
parameter space for Hy is the global parameter space O = {/11, [lg,07,03}. The likelihood 
function is 


m Xy- 2 
L(®) = (2002)-"/? exp (-os oe (=) 
n Aw 7 
x (2703)-"/? exp > ' (=) ; 
i= 02 
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which in ©) = {p1, Uy,07} assumes the form 


EZ, Oa) EL) 


L(©1) = (2m07)—"*"/) exp 
20? 
The parameters that maximize [(O) in ©; are, as usual, obtained by differentiating In L(0,) 
with respect to 1, /l),07 and setting the derivatives to zero to obtain 
jy = (m)~* YO Xai =f 3 = (rn) VL, X2i = fa; 
o** = (m+n) (Sea (Xi — fr)? + pL (Xai — fi)?) 


When these results are inserted into L(©1), we obtain 


—(mtn)/2 
) exp (—(m + n)/2). 


oT m RS n a 
ae (5 baat a ae (Xai — a) | 
To maximize L(@) in © = {,11, 49,07, 03} we differentiate log L(©) with respect to 11, Us, 07, 
o3 and set the derivatives to zero to obtain 


ji = (m)™* OE XY = jy; ius = (n)~* sy X2i = He 


a =(m) 10 (Xu)? = 6 


We note that the maximum likelihood variance estimators 6 mts62,ML of the variance 
o7,03 are not unbiased. When we substitute these results into L(O), we obtain Lei as 


») ™ (2307, ai -i)”) ” 


nm 


2 aot 
1,ML} 92 


1 m 7 
Leu = (Qn) ry? (, Der Xai ~ fa 


x exp (—(m-+ n)/2). 


men 


Finally, with A = Lim/Lem we obtain 
(m-+n)/2 
Py ai a) 


(7.3-15) 


_ (oe (Xai — fy es = 
(smaicar) (encecar) 


This formidable-looking expression can be dramatically simplified by recognizing that 


(m — VST = iy (Xai — fn)? 
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Lambda versus /’, 


Vr 


Figure 7.3-6 The test statistic A versus the variance ratio Ve for m= n= 10. 


so that, after a little algebra, we obtain 


aD m/2 
1 


(1+ [(m—nym—n)x Hr 


where A(m,n) 2 (m+ n)(™+™)/2m-™/2n-"/2, Tt is natural to call Vz 2 62/62 the (esti- 
mator) variance ratio, where 


A (m — iy" pe (Xi — ji)? 


Vr = ——. (7.3-16) 
(im 1) yaa (ae Bea)? 
Then, in terms of Vp, 
—1)/(n—1)| x Va)” 
(1 + [(m — 1)/(n — 1)] x Va) 
When H; is true Ve = Fm—1n—1, where F,-1,-1 is the random variable with the 


F-distribution with m— 1 and n — 1 degrees of freedom, respectively. The variation of 
A with Vp is shown in Figure 7.3-6 for m = n = 10. It should be clear from the figure that 
rejection of the hypothesis, that is, the event {0 < A(Vpr) < c}, is equivalent to the two-tailed 
event {0 < Vr < ti} U {tu < Vr < co}. Hence, given a significance level a, we can solve for 
t; and t,, from P[O < Ve < t)]+Plty < Vr < oo] =a, using A(t;) = A(t,,). But for simplicity 
and without much loss of accuracy, we choose P[0 < Vp < t)] = Plt, < Vr < w] = a/2, 
the numbers t) and t/, being easier to determine than the numbers t; and t,. See Figure 
7.3-7. Indeed with Fr(ag;m — 1;n —1) denoting the CDF of the RV Fiy,-1,.n-1 evaluated 
at the 6 percentile point, that is, Fp(xg;m— 1;n — 1) = B, we observe that t) = %q/2 and 
ty, = Z1~/2- 

The hypothesis H, is rejected when the test yields the event {0 < A < c} or, equiva- 
lently, when {0 < Vp < %q/2} or {%1_~a/2 < Vr}. 
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Figure 7.3-7 The event {0 < A < c} is equivalent to the event {0 < Ve < t}}U{tu < Ve < co}. 
The numbers t; and t, are replaced by numbers t; and t), that make the error in both tails a/2. 


Example 7.3-9 
(numerical example of testing Hy : 07 = 03 = o% versus Hg: 07 4 03) We test the hypoth- 
esis that the variances of two populations are the same. 

We call the RANDOM.ORG routine available on the Internet and create two sets of 
Gaussian pseudo-random numbers as shown in the two rows below: 


N(0,1): 0.436, —1.06, —1.11, 0.46, 0.491, —1.05, 0.502, 0.598, 1.61, 
—0.981, 0.021, 0.253, 1.24, 0.059, 2.12; 
N(0,4): 0.634, 0.0818, 1.32, 2.96, 3.11, 3.13, 2.62, —1.96, 0.85, 
=651, =<339, 425, 108, 342 “72, 


From the top two rows, that is, the (N(0,1)) data we compute fi; = 0.074,4, = 
1.01,67 = 1.01; from the bottom two rows, that is, the (N(0,4)) data we compute jig = 
0.54, G2 = 3.04, c = 9.25. We compute the variance ratio as 


(15 — 1) ye (X1; — 0.54)? a 9.25 
(15 — 1) 22, (Xai — 0.074)? 1.02 


At the level of a = 0.05, and using the “equal-area” system for distributing the error proba- 
bility, we seek the percentile points numbers o.925 and 20,975 such that Fr(xo.025; 14; 14) = 
0.025 and Fr(x0.975;14;14) = 0.975. As an alternative to using F-distribution tables, we 
call the Stat Trek Online Statistical Table for the F-distribution calculator, and enter the 
degrees of freedom (14 in both cases) and the CDF value of 0.025 to obtain 20,925 = 0.34. We 
repeat with the CDF value of 0.975 and obtain 29.975 = 2.98. Thus, the acceptance region 
is the interval (event) (0.34, 2.98) and the critical region is the event {(0, 0.34) U (2.98,00)}. 
The test statistic yields 9.06, an event deep in the rejection region and therefore associated 
with the rejection of the hypothesis that the two variances are the same. Therefore we 
conclude, quite rightly, that the data come from different populations. 


= 9.06. 


R= 


More on the so-called F-test can be found in [7-5] to [7-9] and online by a Google search on 
the entry “F-test.” 
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Testing Whether the Variance of a Normal Population Has a 
Predetermined Value 


In this situation we consider a Normal population and test whether the variance of this 
population has a predetermined value. We proceed as follows: We take m samples from 
a Normal population X, that is, make m i.i.d. observations on X that we label: {X;,i = 
1,...,m}. Under H; we assume that the variance of the population is the predetermined 
o2. The alternative hypothesis H2 is that the variance of the population is not equal to a2 
or, more precisely, that there is not enough evidence to support the validity of H,. As usual 
we begin with the likelihood function and maximize it, respectively, in 0, = {y,0%} and 


2 
© = {u,07}. Thus, L(@1) = (2002)-™/? exp (En, (4) ), which is maximized 


70 


when ji* = ju = (i) 1 Sy he Ts, 


Lom (X;- py’ 
Lym = (2n09)~"/? exp & ae ( a 


Likewise, 


which is maximized when ji 


Hence 
ipa (xe 
_ ~2\—m/2 a LU 
Lam = (216*) exp (-4 ) at ( 3 ) ' 


The generalized likelihood ratio is then 
A=Lzpm/Lem 


= (om, (AS Ay’) - exp (0552, (AS A + ma) 


We note that W 2 yo % — pt)/o0)* is x2,_1. Then 


A= ((m)1W)"” exp (0.5 (W =m), 


which is graphed as W versus A in Figure 7.3-8 for a DOF = 9. 

From Figure 7.3-8 we deduce that the critical event, that is, the event {0 < A < c}, 
is equivalent to {0 < W < t}U{tu < W < oo}, where A(t;) = A(t.) and t) < ty. 
For simplicity, however, we might choose the “equal area” rule by which we seek numbers 
t, < t, such that t; = @q/2 and t), = £1_9/2, where &g/2 and £1_q/2 are a / 2 and 1—(a/2) 
percentiles, that is, F2(to/2;m — 1) = a/2 and Fy2(r1_-9/2;m — 1) = 1 — (@/2) and, as 
usual, a = Plreject Hy|H, true]. 
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Lambda versus Chi-square 


Ww 


Figure 7.3-8 The critical region for A, shown in heavy line along the ordinate, can be related to a 
two-sided critical region on W (shown in heavy lines along the abscissa). 


Example 7.3-10 

(numerical ecample of testing H; : o* = 0% versus Hg: 0” # 0%) For testing purposes we 
draw two sets of Normal random numbers from populations we call Pl and P2, respec- 
tively. The P1 population is N(1,1) while the P2 population is N(1,4). We shall test both 
populations for the hypothesis that ¢? = 1. The numbers are from RANDOM.ORG available 
on the Internet: 


N(1,1) [P1] —0.0644 2.91 -—0.323 1.21 2.66 0.45 1.26 0.923 1.96 1.62 
N(1,4) [P2] 0.705 0.685 0.718 1.03 2.52 1.96 0.417 2.69  —1.52 2.98 


From the P1 data we compute W’ = 10.3. At the 0.05 level of significance the critical region 
is the event {0 < W < 2.7}U(19 < W < oo}. Since W is outside this region, we accept the 
hypothesis that the variance of the Pl population is one. We repeat the experiment using 
the P2 data. Here we compute W’ = 16.5; this is still in the acceptance region (barely) 
so we accept the hypothesis (in error) that the variance of the P2 population is one. We 
repeat the experiment at the 0.2 level of significance and find that the critical region is the 
event {0 < Z < 4.17} U (14.7 < Z < oo}. We find that we still accept the hypothesis that 
P1 has a variance of one but reject the hypothesis that P2 has a variance of one. There are 
two points to be made from this example: (1) Small sample sizes can lead to errors and any 
results drawn from them should be viewed some skepticism; (2) recalling the meaning of a, 
we see that if this parameter is chosen to be very small, the critical region becomes very 
small so that rejection of the hypothesis becomes unlikely. 


7.4 GOODNESS OF FIT 


An important problem in statistics is to test whether a set of probabilities have a predeter- 
mined set of specified values. For example, suppose we wish to determine whether observed 
data come from a standard Normal distribution. Then from the Normal tables of the func- 
tion Psy we can compute probabilities of the form pj = Fsn(ai41) — Fsn (ai), it =1,...,1, 
and compare these numbers with data obtained from multiple, independent observations on 
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an RV X. We can test other distributions in the same way, be they discrete or continuous. 
The general model is that of sorting the data into | “bins” and comparing for 7 = 1,...,1 
the estimated probability p; with the specified probability p;. Typically, if Y; denotes the 
number of outcomes in n trials classified as belonging to “bin” i, then p; = Y;/n. If all of the 
{p;} are close to the corresponding {p;}, it is likely that the data come from a population 
that has the predetermined probabilities. However, if two or more of the p; are far from the 
corresponding p;, we cannot conclude that the tested population has the same parameters 
as the assumed one. The choice of the number of “bins,” say |, for a discrete random variable 
with a finite number of outcomes (the elements of the sample space) is typically the number 
of outcomes; thus for a die, / would be six, and for a coin, / would be two. When we deal with 
continuous random variables, the “bins” become intervals (a;,2;41) 7 = 1,...,1 associated 
with the / outcomes of the form {x; < X < xj41,1 =1,...,1}. Now the choices of | requires 
more thought. How “refined” a test do we need? A refined test, that is, one that contains 
many bins, will typically need far more data than are bins. Acquiring so much data may 
be costly or unrealistic. However, if we choose to make an “unrefined” test, that is, select 
a small number of bins, our test will necessarily be coarse. Alternatively, a large number of 
bins with insufficient data can lead to gross errors and make our test meaningless. 

Such considerations are, more properly, in the province of experimental design and data 
processing. As such they are beyond the scope of the material in this book. 

In the goodness-of-fit test, the hypothesis H; is that a set of probabilities {p;,i = 
1,...,l} satisfies {p; = poj,i = 1,...,1}. The given probabilities {po;,i = 1,...,1} charac- 
terize a probability function such as a distribution function, the outcome probabilities of a 
fair die, etc. We make n i.i.d. observations on an RV X and sort them into / bins depending 
on their values. 

We define an RV X;; as 


x4 1, if the jth observation of X is in bin 2 
0, else . 


We define PLX;; = 1] 4 p; independent of j for i =1,...,1 because of the i.i.d. constraint. 
The RVs 
Ayo 
Y; = a dit = Lig 


denote the number of outcomes in the bin i = 1,...,/ from n trials. Note that = | Y,=n 


and Sy p; = 1. The reader will recognize this as the multinomial law discussed in Section 
4.8, that is, 


! 
nt rt re 


PIY, =171,Y2=12,...,¥, =r] = P(rjn,p) = =a Ps ++ pr! 


ry!re! : 


exp (-4 Ee a) )) (7.4-1) 


, when n >> 1. 
V (20)! pips “Pl 
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l 
The pdf for the jth trial is P; = [[ p;", where = Lig = 1, yi =1, and z;; is 
i=l 


restricted to 0 or 1. The likelihood ee cia with n repeated trials is L(p) a 


1 1 
L(pi,:+: pi) = lla" lle + Tl D; Min =I pri . Under Ay : p; = poi;t = 1,...,1, the 
i=l i=l 
i i l 
local maximum of the likelihood function, Ly ,,, is merely L(po) = [[ ee II ie + TI Dox! 
i= i=l i=1 


l 
= |i Poi The global maximum of the likelihood function is obtained by differentiation 
i=1 


with respect to the p;,i = 1,...,1, while recalling that pi =1. The result is p; = 
i 

Yi/n,i =1,...,l. Thus, Leu = L(Yi/n, Yo/n,--- ,¥i/n) = J] (¥i/n)™. Finally, recalling 
i=l 


that er Y; =n, we find that the generalized likelihood ratio is 


yY; 
nTT' Poi \ * 
A=n ee (2) (7.4-2) 


and the critical region is 0 < A < A,. To compute the critical region at a specified level 
of significance, we need the distribution of A. However, the exact distribution of A under 
Hf, for an arbitrary value of n is difficult to obtain. It is shown elsewhere [7-1] that —2In A 
under the large sample assumption is approximately \7_,. 

We consider here another approach. From Equation 7.4-1 we see that the Y;,2 = 1,...1, 
under the large sample assumption can be approximated by Normal RVs N(np;,np;),i = 


.l, while the U; = am ap i=1,...1, are approximately standard Normal. Now consider 


an test statistic 
i — NPoi 
V= . 7.4-3 
oe 1 (A J NPoi ): ( ) 


which is called the Pearson test statistic, and accepting or rejecting a hypothesis based on 
the size of V is called Pearson’s test or the Chi-square test [7-16] to [7-20]. Pearson’s test 
statistic has the form of a x? RV with | degrees of freedom but, in fact, has only /—1 degrees 
of freedom because Y; = n — Spay Y; is completely specified once the Y;, Yo,...,Yi-1 are 
specified. Now, if the Y; come from a population with probabilities po;,2 = 1,...,1, we 
expect that a realization of V will be small. However, if the Y; come from a population 
with probabilities p;,i = 1,...,1, where at least two of the p; are significantly different from 
the corresponding po;, we expect realizations of V to be large. We can demonstrate this by 
computing E[V] under H, and Hz. Under H, we compute E[V|H,] = 1-1 (see Problem 
7.24). However, under Hy we compute E[V|Ho] as 


E(V|H2] © ar (poi) *n(p1i — Poi)” (7.4-4) 


when n is large (see Problem 7.25). Clearly E[V|H2] can become arbitrarily larger than 
1—1 when at least some of the p;; are different from po;. An exact computation of E[V|Ho] 
would show that it can never be smaller than / — 1. 
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Returning to the test statistic in Equation 7.4-3, that is, 
2 
A ay Yi- ti MPOi 
i=l JNPoi 
we note that under H; it is x7_,. To find the constant c that determines the critical region 
{V > c} at significance a, we solve f~™ f,2(x;1 — 1)de = a or, equivalently, 1 — a = 
Fy2(c;1—1). So we find that c = x1_q, the 1— a percentile point of y7_,. Thus our criterion 
becomes: accept Hy if V < x,~q, else reject Ay. 


Example 7.4-1 
(fairness of a coin) We wish to test the hypothesis H, that po: = P{heads] = 0.5 = poo = 
P{tails] at a level of significance a = 0.05. We flip the coin 100 times and observe 61 heads 


and 39 tails. Then from 5 
i — Poi 
V= ra eg 
al 


1 
we obtain Vv’ = 05x i00% = 50]? + 05x 100°" = 50]? = 4.84. We compute the critical 


value from 0.95 =F (0.95; 1), which yields xo.95 = 3.84. Since V’ = 4.84 > 3.84 we reject 
the hypothesis that the coin is fair. 


Example 7.4-2 
(fairness of a die) We wish to test the hypothesis, at significance 0.05, that a six-faced die 
is fair. We let Y;,2 = 1,...,6, denote the number of times face 7 shows up. We cast the 
die 1000 times and observe Y/ = 152, Y3 = 175, Yj = 165,Y/ = 180, Y’ = 159, Yg = 171. 
Then 


Vi= (187 = 150)" (167 = 175)? + ey = 165) + (167 = 180)" 


167 | 
+(167 — 159)? + (167 — 171)?] = 3.25. 


The degree of freedom is five so we solve 0.95 = F2(£0.95): This yields xo,95 = 11.1 and 
since 3.25 < 11.1 we accept the hypothesis that the die is fair. 


Example 7.4-3 
(test of Normality) We wish to determine whether data are from a standard Normal N(0, 1) 
population. We let H, be the hypothesis that X is a distributed as a standard Normal 
N(0,1) and Hp be the alternative that X is not distributed as N(0,1). We use differences 
of the cumulative Normal distribution for the {po;} as follows: 


A A 
Po. S F's n(—2.0) = 0.023; poz = Fyn (-1.5) = Fgn(—2.0) _ 0.044; pos = F's n(—1.0) = 


Fsn(-1.5) = 0.092; poa = Fsn(—0.5) — Fsw(—1.0) = 0.145; pos = Fsw(0) — Fsw(-0.5) = 

0.1915; poe & Few(0.5) — Fsw(0) = 0.1915; p07 2 Few(1.0) — Fsw(0.5) = 0.15; p03 & 
A 

Fyn (1.5) — Fyn (1.0) = 0.092; poo = Fn (2.0) = Fn (1.5) = 0.044; po10 = F'n (oo) = 


F5n(2) = 0.023. 
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In a 1000 observations we observe the following realizations: 


in the interval(—2,—1.5] : Yj = 42 
in the interval(—1.5, —1] : Yj = 96 
in the interval(—1, —0.5] : Yj = 135 
in the interval(—0.5, 0] : YZ = 202 
in the interval(0, 0.5] : Yg = 193 


in the interval(1, 1.5] : Yg = 72 
in the interval(1.5, 2] : Yy = 53 
> if 


A wo [Yi — 1000po; 
We use V = pynet pe 


true. From the given data compute V’ = 12.9. Since 29,95 = 16.92 and 12.9 is less than 
16.92 we accept the hypothesis that the data are Normally distributed. 


2 
) as the test statistic and observe that V is x@ if Hy is 


We can use Pearson’s test statistic to test whether two unknown probabilities are equal 
even if no other prior information such as means and variances is available. For example 
we test two brands of printing paper in printing presses: Brand A clogs the presses six 
times in 150 trials while brand B clogs the presses 25 times in 550 trials. Are brands A 
and B equally likely to clog the presses? Two speech recognition programs are available 
for purchase. Assuming the same speaker, we find that speech recognition program SR1 
mistakes 61 words out of 250 while SR2 mistakes 30 words out of 110. Are both programs 
equally effective? In the framework of probability theory we model this as follows: We 
consider the occurrences of two events say E; and E> and we ask, Is P[E\] =P[E2]? Define 
Z, as the number of times we observe the occurrence of EF, in m trials and Zp as the 
number of times we observe the occurrence of EF, in n subsequent trials. Let p; 4 P(E] and 
po = P(E]. Let m >> 1, >> 1, then by the Central Limit Theorem Z,:N(mp,,mp,q1) 
and Z2:N(np2,np2q2). We define the normalized RVs Y; - Z1/m:N(pi,pigi/m), and 


Yo = Z2/n:N (p2,pigi/n) and consider the RV Y = Y, —Y. Since Y,and Y2 are independent 
(recall that Y; results from observations in the first m trials while Y2 results from observa- 
tions in the next n trials), it follows that Y is Normal with mean p; — p2 and variance o}- = 
(npiqi + Mp2q2)/mn. Let H, be the hypothesis that p; = pz and the alternative Hz be that 
p1 # po; clearly under H,, Y:N (0, piqi(m + n)/nm). The Pearson test statistic adapted to 


this problem is 
Y_ _ 2 
V- ( (pi a) 
Oy 


which is seen to be yj. For a test of significance a we find the percentile z;_, in 1 — 
a = Fy2(#1~4;1) such that if V < 21 we accept the hypothesis; else we reject the 
hypothesis. 
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The difficulty with this problem is that oy is unknown since p; and p2 are unknown. One 
way out of this difficulty is to replace ay by an estimate of oy based on our observations. 


Under Hi, p1 = p2 = p, and the minimum variance, unbiased estimator of p is p = (Z1 + 


Z2)/(m+n). It follows that under Hy, éy = \/pG(m + n)/mn, where G = 1—p. We illustrate 
with two examples. 


Example 7.4-4 
(voting patterns in different regions) In the Governor’s race in a large state, exit polls 
showed that in a rural upstate county 167 out of 211 voters voted for the Republican 
while in a downstate county that includes a large metropolitan area, 216 out of 499 voters 
voted Republican. Can we assume that the probability, p,, that an upstate voter will vote 
Republican is the same as, po, that of a downstate voter? 


Solution Under Hj,p; = pe . p, while under H2,p; #4 pz. Under H;, we compute 
p = 388/710 = 0.54, @ =0.46, oy =0.041, Y/ = 167/211 = 0.79, YJ = 216/499 = 0.43, 
and Y’ & Y/ —Y3 =0.36; hence V’ = (0.36/0.041)? = 77. At a significance level of a = 0.05, 
we find that 20.95 = 3.84. Since 77 > 3.84, the hypothesis is strongly rejected. 


Example 7.4-5 
(interpretation of scientific data) In an attempt to find out whether Rhesus monkeys can 
be made to distinguish and possibly attach meaning to different sounds, including spoken 
language, the following experiment was performed. A Rhesus monkey was put in an anechoic 
(external-soundproof) chamber with a computer-controlled directional loudspeaker that 
randomly emitted bursts of one of two signals: S1, a sound of the type that the Rhesus 
monkey might hear in its natural habitat; and $2, a sound characteristic of a spoken word. 
If the monkey, upon hearing a sound burst, turned its head toward the loudspeaker, it was 
taken to mean that the monkey was reacting to the sound. If the sound was of an S2 type, 
it could mean that the monkey was curious or interested in the sound and could possibly 
be trained to accept the sound as a word. However, if the monkey showed no reaction to 
the sound, it was taken to mean that the monkey attached no significance to it. From the 
researcher’s point of view the ideal case would be if the monkey never turned its head when 
exposed to an $1 sound and always turned its head when exposed to an $2 sound. Then the 
researcher could write a scholarly paper on the cognitive abilities of the Rhesus monkey and 
become famous.’ We shall ignore the perplexing problem of deciding whether the monkey’s 
head has rotated enough to be scored as a “turned head.”? 

In 267 bursts of “natural habitat”-type sounds, the monkey turned its head 112 times; 
in 289 bursts of spoken word sounds, the Rhesus monkey turned its head 173 times. Let p; 
denote the probability that a monkey will turn its head upon hearing a “natural habitat” 
sound and p2 denote the probability that the monkey will turn its head upon spoken-worn 


+This research is being done at a major university but the results have generated controversy in the 
scientific community. 

¥A problem similar to the “checked swing” problem in baseball, where the umpire must decide whether 
a batter “followed through” or “checked his swing.” 
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sounds. Under H,,p, = p2 = p while under H2,p, #4 po. Can we accept the hypothesis 
that the monkey shows no differentiation in its reaction to the two sounds, that is, that 


Ay, p, = pe & p, is true? 
Solution Under Hj,’ =0.51, q@ = 0.49,6’ =0.0424, Y/ = 112/267 = 0.42, YJ = 


173/289 = 0.6, and Y’ = Y/ — YZ = 0.18; hence V’ = (0.18 /0.0424)? = 18; at the 0.05 
level of significance 29.95 = 3.84. Hence the hypothesis is strongly rejected. 


7.5 ORDERING, PERCENTILES, AND RANK 


For the reader’s convenience we repeat here some of the material from Section 6.8 of 
chapter 6. We make n i.i.d. observations on a generic RV X (sometimes called a popula- 
tion) with CDF Fx (a) to obtain the sample X1, X2,...,X». The joint pdf of the sample is 
fx(a1)x:+ +X fx (an), —00 < @ <0c0o,i=1,...,n. Next we order the X;,i = 1,...,n, by size 
(signed magnitude) to obtain the ordered sample Yi, Y2,..., Yn such that —co < Yj < Yo < 
--+< Y,, < oo. This is sometimes called the order statistics of the observations on X. When 
ordered, the sequence 3, —2,—9,4 would become —9, —2,3,4. If a sequence X1,..., X29 was 
generated from n observations on X : N(0,1), it would be very unlikely that Y; > 0 because 
this would require that the other 19 Y;,7 = 2,...,20, be greater than zero and therefore all 
the samples would be on the positive side of the Normal curve. The probability of this event 
is (1/2)°. Likewise it would be extremely unlikely that Yao < 0 because this would require 
that the other 19 Y;,72 =1,...,19, be less than zero. As shown in Section 5.3, the joint pdf of 
the ordered sample Yi, Y2,..., Yn is n! fx (yi) x---* fx (Yn), -CO < Yr < Y2 < +++ << Yn < OV, 
and zero else. Ordering and ranking are not the same in that ranking normally assigns a 
value to the ordered elements. For example most people would order the pain of a broken 
bone higher than that of a sore throat due to a cold. But if a physician asked the patient 
to rank these pains on a scale of 0 to 10, the pain associated with the broken bone might 
be ranked at 8 or 9 while the sore throat might be given a rank of 3 or 4. 

Consider next the idea of percentiles. We have used the notion of percentiles in other 
places in the book; here we briefly discuss it in greater detail. Assume that the IQ of a large 
segment of a select population is distributed as N(100, 100), that is, a mean of 100 and a 
standard deviation of 10. Obviously the Normal approximation is valid only over a limited 
range because no one has an IQ of 1000 or an IQ of —10. The IQ test itself is valid only over 
a limited range and may not give an accurate score for people that are extremely bright 
or severely cognitively handicapped. It is sometimes said that people in either group are 
“off the IQ scale.” Still the IQ test is widely used as an indicator of problem-solving ability. 
Suppose that the result of an IQ test says that the child ranks in the 93rd percentile of the 
examinees and therefore qualifies for admission to programs for the “gifted.” How do we 
locate the 93rd percentile on the IQ scale? 


Definition (percentile): Given an RV X with CDF Fx(x), the u-percentile of X 
is the number 2, such that F'y(x,) = u. If the CDF F'y is everywhere continuous with 
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u=Fy (x) 


(a) 


Figure 7.5-1 (a) The standard Normal CDF; (b) the inverse function. 


continuous derivative, then 7, = Fy ‘(u), where the function F xX ' is the inverse function 
associated with the CDF Fx, that is, Fx! (Fx(au)) = au. The standard Normal CDF and 
its inverse are shown in Figure 7.5-1. 


Observation In the special case of the standard Normal, where Z : N(0,1), we use the 
symbol z, to denote the u-percentile of X. If X:N(,o7), then the u-percentile of X, xy, 
is related to z, according to 

Ly = Lt Zy0. (7.5-1) 


Example 7.5-1 
(relation between xy and Z,) Show that 2, = + Zuo. 


Solution We write 


(@u—")/o 
= ony | exp (-52) a: 
A aga f 1, 
= (27) exp | —52 dz. 


The last line is the CDF of Z:N(0,1). Hence x, = 4+ 2,0. We can use this result in the 
previously mentioned IQ problem. From the data we have F'y(x,) = 0.93 = Fz(z,,). From 
the table of F'sy, we get that z, + 1.48. Then with x, = 4+ 2,0 = 100 + 1.48 (10), we 
get that a 93 percentile in the IQ distribution corresponds to an IQ of 115. 
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How Ordering is Useful in Estimating Percentiles and the Median* 


We briefly review here some of the material of Section 6.8 that is associated with percentiles 
and the median. 

The median of the population X is the point 29.5 such that F'y(xo.5) = 0.5. This is to 
be contrasted with the mean of X, written as j1x, and defined as py = f° efx (ax)dx. The 


median and mean do not necessarily coincide. For example, in the case of fx (a2) = Ae~>* u(x) 
we find that wy = 1/A but 29.5 = 0.69/X. To compute the mean of X we need fx (x), which 
is often not known. The mean may seem like a rather abstract parameter while the median 
is merely the point that divides the population in half, that is, half the population is at 
or below the median and half above. The situation where f(x) is assumed to exist and 
for which we can extract or estimate parameters is called the parametric case. Typically, in 
the parametric case, we might assume a form for the population density, for example, the 
Normal, and wish to estimate some unknown parameter of the distribution, for example, 
the mean jx. Then given n i.i.d. observations X1, X2,...,X, on X, we estimate wx with 
fix =n! SO, Xj, which happens to be an unbiased and consistent estimator for the mean 
of many populations. Indeed it is the simple form of the mean estimator function jiy and 
the fact that if o% is finite then fix — jx for large n (see the law of large numbers) that 
make the mean so useful in many applications. The estimation of parameters in known 
or assumed distributions and other operations, for example, hypothesis testing involving 
known or assumed distributions, is known as parametric statistics. 

The estimation of the properties and parameters of a population without any assump- 
tions on the form or knowledge of the population distribution is known as distribution-free, 
robust, or nonparametric statistics. Statistics based on observations only without assuming 
underlying distributions are robust in the sense that the theorems and conclusions drawn 
from the observations do not change with the form of the underlying distributions. Whereas 
the mean and standard deviation are useful in characterizing the center and dispersion of a 
population in the parametric case, the median and range play this role in the nonparametric 
case. To estimate the median from X,, X2,..., Xn, we use the order statistics and estimate 
Xo.5 with the sample median estimator 


Yo.5 = Yeui if n is odd, that is, n = 2k +1 
(7.5-2) 
= 0.5(Y, + Ye41) if n is even, that is, n = 2k. 


The sample median is not an unbiased estimator for 29,5 but becomes nearly so when n is 


large. The dispersion in the nonparametric case is measured from the 50 percent percentile 


: A . ; A 
range, that is, Axvo.59 = 20.75 — 0.25, or the 90 percent percentile range, that is, Axo.99 = 


£0.95 — £0.05, Or some other appropriate range. 


* Readers familiar with the contents of Section 6.8 can skip this subsection. 

+Thus it is not wholly accurate to say that “half the population is below and half above” the median. 
Moreover the reader should be aware that the median of a sample is typically not the same as the median 
of the whole population. 
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1 Fo¥3 Vays V6 7 V8 Yo Vi0 


Figure 7.5-2 Estimated percentile range from ten ordered samples showing linear interpolation between 
the samples. To get the estimated percentile take the ordinate value and multiply by 100/11. Thus, to 
a first approximation, the 90th percentile is estimated from y,) while the 9th percentile is estimated 
from y,. An approximate 50 percent range is covered by yp — y>. 


Example 7.5-2 
(interpolation to get percentile points) Using the symbol a~ 3 to mean a estimates 3, we 
have Y3 ~ 0.273, Y4 ~ Xo.364, and using linear interpolation, we get 20,3 as 
(Y4 — Y3)(0.3 — 4/11) 
Y. ~ ‘ 
ar 1/i X0.3 
Linear interpolation between ordered samples is illustrated in Figure 7.5-2. 


We discuss next a fundamental result connecting order statistics with percentiles. Once 
again the model is that of collecting a sample of n i.i.d. observations X,, X2,...,Xy on an 
RV X with CDF Fx (a). We recall the notation P[X; < 2, 2 u. Next we consider the 
order statistics Y; < Yo < +--+: < Y,. Now consider the event {Y;,<2,,}. Since Y;, is the kth 
element in the ordering of the {X;}, there are at least k of the {X;} that are less than vy. 
There may be more but certainly not less. Then, because the {X;} are i.i.d. we can use the 
binomial probability formula to compute 


P|Y, < %] = P|at least k of the {X;} are less than x, 


oe (7) wil — wy", (7.5-3) 


Next consider the event {Y,1, > 2, }. Since Y¥,4, is the (k +1)th element in the ordering of 
the {X;}, there are at least n—(k+r)+1 of the {X;} that are greater than x,,. Equivalently, 
there can be no more than k +r —1 of the {X;} less than «,,. Then 


P[Yn4r > Lu] = P [no more than k + r — 1 of the {.X;} are less than xy] 


=. & ui(1 — u)?-4. (7.5-4) 


a 
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The intersection of the events {Y,1, > @,}U{Y; < x,} is the event {Y, < vy, < Yp1,}. Its 
probability is 


REL PRN AG n=i 
PLY, < tu < Yesr| = ‘nie (; ye (1 —u) (7.5-5) 
and is independent of fx(x). The result given in Equation 7.5-5 is one of the major results 
of nonparametric statistics and has important applications, for example, estimating the 


median of a population, as we illustrate below. 


Example 7.5-3 
(How large a sample do we need to cover the median at 95 percent confidence?) We seek the 
end points Y;, Y;, of a random interval [Y;,Y,] so that the event {Y1 < 20.5 < Y,} occurs 
with probability 0.95. Here Y; 2 min(X),Xo,...Xn),¥, & max(X1, Xo,...X,). In effect, 
how large should n be? 


Solution We compute 


PIM <e%5s<Y)=)> >, Qe ~ 0.95 


and find that for n = 5, P[Y, < xo.5 < Ys] ¥ 0.94. The probability that the random interval 
(Yi, Yn] covers the 50 percent percentile point is shown in Figure 7.5-3 for various values 
of n. 

Example 7.5-4 
(most probable adjacent ordered pair to cover 29.33) We have the order statistics {Y,, Y2,..., 
Y,} and wish to find the pair {Y;,Yj41,i = 1,...,n — 1} that maximizes the probability 
of covering the 33.33rd percentile point. The 33.33rd percentile point 20.33 is defined by 
1/3 = Fx (29.33). For specificity we assume n = 10. From Equation 7.5-5 we compute 

10! 

k\(10 — k)! 


n—-1 


PIY, < ®0.33 < Ye+il = (1/3)*(2/3)°*,k =1,...,9 


and plot the result in Figure 7.5-4. Clearly the interval [Y3, Y4] is most likely to cover 20.33. 
The probability of the event {Y3 < 70.33 < Y4} is 0.26. 


Probability that random interval covers the median 


me 125 

& 14 

3 
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>8 
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Figure 7.5-3 Probability that the event {Yi < x05 < Y,} covers the median for various values of n. 
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Probability that the 33rd percentile 
point is covered by the kth 
adjacent ordered pair 


12 3 4 5 67 8 9 
kth ordered pair 


Figure 7.5-4 Among the pairwise intervals [Yx, Y<+1], the interval [Y3, Y4] is most likely to cover xo.33. 
Here n = 10. 


Example 7.5-5 
(the median and mean are not the same for the binomial) We make the somewhat trivial 
observation that for the binomial case the mean and median do not coincide. For example 
with p = 1/2 and n = 4, the mean is 2 but the median, such as it is, is somewhere between 1 
and 2. However, when n is large the median and mean approach each other and the median 
can be estimated by the mean. Indeed it can be shown that the error between the mean 
and median is proportional to (p(1 — p))", which becomes arbitrarily small for n — oo. 


Confidence Interval for the Median When n Is Large 


If n is large enough so that the Normal approximation to the binomial is valid in distribution, 


we can use 
Pla<S,< =| ex =i dy, 


Pla< $< A= 0 (7 era — pyri, 


t=a 


where 


A a—np—0.5 


pj aad (7.5-6) 
np(1 — p) 
3 A B-npt+0.5 
" np(1 — p) 


To apply these results to the problem at hand we write 
n—-Tr n a: 
PUY. < ams <Yaorual =” (7) 0/2" (7.5-7) 
where we used that, by definition of the median, u = Fy(vo.5) = 1/2. The choice of 


subscripts will ensure that the confidence interval will begin at the rth place counting from 
the bottom, that is, 1, 2, 3,..., 7, and end at the place reached by counting r observations 
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back from the top. For example if the 95 percent confidence calculation for n = 10 yields 
r = 3, the confidence interval begins at the third observation and ends at the eighth 
observation, both points reached by counting three places from bottom and top, respec- 
tively, that is, 1, 2, 3 (Ys) and 10, 9, 8 (Yg), and the result would appear as P[Y3 < 20.5 < 
Yg] = 0.95. 

In the binomial sum in Equation 7.5-7 we note that its mean is n/2 and its standard 
deviation is\/n/2. Hence the Normal approximation to the binomial sum in Equation 7.5-7 
for a 95 confidence interval is 


ror (my igyn my, tf 1 dx = 0.95 
Wie j (1/2) oh _ exp[—52"|dx = 29, 


which, from the tables of the standard Normal distribution function F'sy (x), yields an, = 
—1.96, @,, = 1.96. Then it follows from Equation 7.5-6 that 


1.96 = n—-r—n/2+0.5 
Jn/2 
r—n/2—0.5 
—1.96 = — fnf2’ 


which yields r = (n/2) — 1.96,/n/2 + 0.5. If r is not an integer replace r by |r|, which is 
the least integer function, that is, that largest integer less than or equal to r. 


Example 7.5-6 
(95 percent confidence interval for the median for n = 20) We make 20 observations on an 
RV X and label these {X;,7 = 1,..., 20}. We order them by size so that Y; < Yo <---< Yh. 
We use r = (n/2) — 1.96./n/2 + 0.5 to obtain r = 6.12 and |r| = 6. Then P[Y¥6 < zo.5 < 
Yi5] > 0.95. 


Distribution-Free Hypothesis Testing: Testing If Two Populations 
Are the Same Using Runs 


In general, hypothesis testing using nonparametric statistics is more involved than in the 
parametric case because of the difficulty of computing the distribution of the test statistic. 
However, when the size of the samples is large, say greater than 10, we can use the Normal 
approximation for computing the acceptance/rejection region. 

We introduce the idea of a run by considering the following simple situation. We make 
ny, observations on an RV X (the “population”) with CDF Fx (a) and label these samples 
{x 4 =1,...,n,}. After ordering them by size we create the samples {Y;,i = 1,...,n1}. 
Then we make nz observations on the same RV X and label these samples {X OQ) i = 
1,...,n2}. We order these samples by size to obtain the ordered set {Z;,7 = 1,...,n2}. Next 
we combine the two unordered sets of samples into a single set and order them by size. Then 
a typical ordered sequence might be 21, 22, Y1, 73, Y2,---; Zns, Yny—1, Yn,, Where Z1 < Zo < 
Y, < Z3 < Yo < +++ < Za, < Yn,-1 < Yn,. We define a run as a sequence of letters of the 
same kind bounded by letters of the other kind or the beginning/end of the entire sequence. 
Thus Z 1, Z2 is the first run and its length is two. The next run is Y; and it has length one, 
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etc. The last run is Y,;,-1Yp, and it has length two. We count the total number of runs 
and call this D. We note that D is a random variable. Since the two sets of samples come 
from the same population, we expect a thorough mixing of the Y’s and Z’s and therefore 
a large D. Note, however, that had the Y’s and Z’s come from different populations, D 
would, in all likelihood, be significantly reduced. For, example, suppose that we have two 
populations, say, X“) with pdf fya)(x) = rect(x) and X(?) with pdf fy (x) = rect(a—2). 
If {Y;,i = 1,...,n} represent the ordered sequence from the X“) population and {Z;,i = 
1,...,n} represent the ordered sequence from the X) population, then the ordered samples 
of the mixed sequence will appear as Y|Y2---Y,Z,Z2-+:Z, and will have D’ = 2 since the 
support of their pdf’s don’t overlap. 


Example 7.5-7 
(realizations of D for populations of equal and different means) We generate two sets of 
ten Normal random numbers (we show only to two places) from N(0,1) obtained from 
RANDOM.ORG, a Normal random number provider available on the Internet. 


N(0,1) — {2™: —0.19, 0.99, —1.1, —1.0, —1.3, —0.53, —0.25, 0.75, —0.25, 0.75 } 
N(0,1) — {c@: 0.68, —1.2, 0.28, 0.61, —1.2, —1.5, 2.1, —0.10, —0.87, 0.80 }. 


We order by size the x“) and x) sequences separately to create, respectively, the ordered 
sequences yyyo°:: Yio and 2122-:- 219, where y,; = —1.3, yj9 = 0.99, 2; = —1.5, and z49 = 
2.1. After combining the two sequences into a single sequence and ordering all the elements 
of this sequence by size, we get the sequence 21 Y12223Y2Y324Y4Y5 Y6Y7 25 262728 Y8Y9Z9Y10210; 
which yields D’ = 11. 

We now repeat the experiment and select ten random numbers from the standard 
Normal distribution, that is, N(0,1), and another ten from N(1, 1); the numbers are displayed 
to two places. The result is 


\— {2: —0.079, 1.3, 0.15, 1.2, 0.75, —1.2, —0.11, —0.84, 0.35, 0.55 } 
)— {x @: 1.2, 0.056, 0.3, —0.77, 0.95, 1.1, 0.095, —0.43, 1.1, 1.3}. 


Here the ordered y sequence is associated with the N(0,1) and the ordered z sequence 
is associated with N(1,1). After combining the two sequences into a single sequence and 
ordering all the elements of this single sequence by size, we get the sequence 


Y1Y221422U3Y4Y5 3245 U6 U7 U8 ~6~7~8~9Y9Y10~10; 


which yields D’ = 8 and has 27 percent fewer D’s than in the N(0, 1). This example suggests 
that the RV D can be used as a statistic for testing the hypothesis that the populations are 
the same. If D is large enough, say D > do, we may conclude that the two samples come 
from the same population; else we reject that they come from the same population. The 
choice of dp is discussed below. 


We test whether two samples come from the same population using the principles of hypoth- 
esis testing. We have two sets of samples: {x 5 = 1,...,n1} and fe) 4 = 1)...;na}- 
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The null hypothesis, H,, is that the two samples come from the same population, while 
the alternative, Hz, is that they do not come from the same population or, perhaps more 
accurately, that there is not enough evidence that they come from the same population. 
The test will be based on observing the test statistic D. If the event {D > do} occurs, 
then the two samples interweave well and we may conclude that they come from the popu- 
lation. If the event {D < do} occurs, we may conclude that H; is not supported by the 


data. If a Plrejecting H|H, true] denotes the level of significance, then a = P[D < 
do|H, true!) = > Pp(d;ni,n2), where Pp(d;n1,n2) is the probability of observing d 
all d<do 
runs in interwoven sequences of lengths n; and ng. 
Computing Pp(d;n1,n2) requires some rather sophisticated counting procedures so we 
give only the final result here. Define 


Under the null hypothesis we find that 
2C (aya) CE 2) _/Crtn2,d even 
(Cit aCe 3) 3)/2 + Clas y2€ tan 1 Wy oan d odd. 


These unwieldy formulas do not yield much for the purpose of analysis and require machine 
computation to evaluate a. However, it has been shown that for n; > 10,n2 > 10, the 
distribution of D is well approximated by a Normal CDF with approximate mean and 
variance given by, respectively, 


Po(d; Ny, N2) = 


2nin n . n ; 
ee, 0%, &A(ny + 12) : z ‘ 
ny + ng ny + ng ny + Ng 


Hence we approximate a = P[D < do|H, true!) = S>  P(d;ni,n2) with 
all d<do 


1 2a 1 do — 
a= ye P(d) = ral exp (-52° Jae, Za a 2 


all d<do 


Hp ®© 


Example 7.5-8 
(run test on sameness of two populations) We request two sets of ten random numbers from 
RANDOM.ORG from a population N(1,1) and order these by size as 


N(1,1) > {y : -1.4, — 0.33, 0.40, 0.44, 0.70, 0.74, 1.3, 1.3, 1.7, 2.4} 
N(1,1) > {y® : —0.67, — 0.21, 0.38, 0.38, 0.51, 0.71, 1.4, 1.5, 2.0, 2.9}. 


For calibration we co-join these two sequences into a single sequence and order the elements 
of the sequence by size. We find that the realization D’,,, = 12. We then request a set of 
random numbers from an “unknown” Normal distribution and order these by size as 


fy) + =3,8, ~2.5, —0.13, 2:2, 2.8, 3.0, 3:8, 4.6, 5.5,5.8}. 
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After interleaving these by size with the {y“)} sequence and counting the runs, we get 
D! = 6. We wish to test the hypothesis that the {y)} and{y)} sequences come from the 
same population at the 0.05 level of significance. We solve 


a ae * 2? \ dx = 0.05 
a exp {—x=2" |dr =0. 
V2 Jo 2 


and find that zo.95 = —1.65. For the given sample sizes we find that 1p = 10,0p = V5. 
Thus, do = 0p20.05 + fp = 6.3 and since D’ < do (barely), we reject the hypothesis that 
{y)} comes from a N(1,1) population. Indeed, in this case, the {y')} sequence comes 
from a N(1,3) population. 


Ranking Test for Sameness of Two Populations 


Another procedure for testing the sameness of two populations is the so-called ranking test. 
Assume that we have two continuous populations X and Y with respective distribution 
functions F'x(x) and Fy(y). We wish to test the hypothesis H,: Fy = Fy versus the alter- 
native Hj:Fy 4 Fy. We take n; samples from X and nz from Y, co-join them, and order 
them by size. Then we assign to each element of the sequence a number denoting its place 
in the ascending order; for example, the event X; < Yi < X2 < X3 < Yo < Y3 < Y, would 
be designated as 
X1 Y, Xo X3 Yo Y3 Ya 
12 3 4 5 6 7° 


The number associated with each element is its rank, and the Y sequence has ranks 2, 5, 
6, and 7. Here n, = 3, no = 4. The rank of the last element in the sequence is n, + n2 and 
the rank of the first is 1. It is shown elsewhere that the RV 


FS x ranks 


Y sequence 


is a suitable test statistic to test the hypothesis that F'y(x) = Fy(x), for all x. If T is too 
large or too small, the hypothesis is rejected. To test the hypothesis at a level of significance 
a, we need the distribution of T under the null hypothesis. It is shown elsewhere ([7-22] to 
(7-24]) that when n, > 7,n2 > 7 (ideally we would want them larger), T is approximately 
distributed as N(pp, 07) with pp = no(n1 + ng + 1)/2,07 = nyno(n1 + n2 +1)/12. In the 
example above we find pp = 16,07. = 8. 


Example 7.5-9 
(ranking test on sameness of two populations) We use the {y“} and {y®)} sequence of 
Example 7.5-8, co-join them, and assign ranks to the elements of the ascending sequence. 
For the elements of the {y‘)} sequence, the ranks are 1, 2, 5, 13, 15, 16, 17, 18, 19, and 20; 
their sum is 126, 4p = 105, and op = 13.23. The hypothesis is that the two sequences come 
from the same population. At a level of significance a = 0.05, we solve F’r(20.025) = 0.025 
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and get 29.925 = —1.96 so that the critical region is {T > 131} U{T < 79}. So we accept 
the hypothesis—in error—that the two sequences come from the same population. At a 
significance level a = 0.1 the critical region is {T > 127} U{T < 87}. Marginally above 
a= 0.1 , the hypothesis is rejected. 


SUMMARY 


Hypothesis testing is a major branch of statistics that deals with decision making in a 
random (i.e., probabilistic) environment. In the beginning of this chapter we put ourselves 
in the mind of a surgeon who faced a difficult decision regarding whether to operate on one 
of his patients. By using all available prior information and seeking to minimize the average 
risk, we derived the Bayes decision rule, which—arguably—is the most rational approach to 
making decisions when available information is of the probabilistic kind rather than being 
categorical. The Bayes decision rule leads to a likelihood ration test (LRT). 

The prior probabilities (sometimes called a priori probabilities) required in Bayes testing 
may not always be available in which case the threshold in the LRT for accepting/rejecting 
the hypothesis is determined not by minimizing the average risk but by the specified error 
probability a, which is the probability of rejecting the hypothesis based on observational 
data when in fact the hypothesis is true. In the case of testing a simple hypothesis versus a 
simple alternative, the Neyman—Pearson Theorem ensures that the LRT is optimum in that 
it is the most powerful test. By this is meant that the probability of rejecting the alternative 
hypothesis when it is true is driven to a minimum. 

In a number of situations, testing a simple hypothesis versus a simple alternative won’t 
do because the hypothesis or the alternative or both involve many outcomes in the under- 
lying sample space. In that case the generalized likelihood ratio test (GLRT) is useful. We 
illustrated the GLRT with a number of examples and, in doing so, encountered such classic 
statistical tests as the F-test, the t-test, and the Pearson Chi-square test. 

We then considered ordering, percentiles, and rank and illustrated how these tools can 
be made useful in distribution-free (sometimes called robust) statistics. We illustrated these 
with hypothesis testing examples using run tests and ranking tests. 


PROBLEMS 


7.1 Prove Equation 7.1-6. 

7.2 Consider Example 7.1-1. Let the prior probabilities be P, = 0.9, Pp = 0.1. How does 
this affect the Bayes decision rule? 

7.3. Assume a Normal population X:N(,1) and a sequence of i.i.d. observations on X, 
that is, {X;: i= 1,...,n}. Find the critical region for testing the hypothesis that 
A : 4 = py, versus the alternative H, : 4 > pu, at the 0.05 level. 

7.4 Show that the power P of an LRT is given by P = Plreject H,|H> is true}. 

7.5 Why was it not necessary to invoke the Central Limit Theorem to argue that ji (n) 
in Example 7.2-2 is Normally distributed? 
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7.6 


7.7 


7.8 


7.9 


We flip a coin 100 times and observe 50 + k& heads and 50 — k tails. What is the 
largest value of k that will enable us to accept the hypothesis that the coin is fair at 
a = 0.05 significance. Repeat for a = 0.01. 

A customer in a sub-freezing environment is considering buying an automobile battery 
at DBW (“Discount Battery Warehouse”). The particular battery model of interest 
is imported from one of two possible sources, say A and B, which do not share the 
same quality-control standards. The better import (A) will start the car 90 percent 
of the time in sub-freezing weather while the worse import (B) will start the car only 
50 percent of the time in such weather. There are an equal numbers of batteries from 
each source. The imports cannot be differentiated by any external visible features. 
The battery salesman will allow the customer only one try at starting his car with a 
test battery, under sub-freezing conditions, before purchase. 

We shall treat the customer’s dilemma, such as it is, from a hypothesis testing point 
of view. Let the hypothesis be Hj: the battery start-probability p; = 0.9 versus the 
alternative Hy»: the battery-start probability po = 0.5. There are two actions: a; (buy 
the battery) and ay (reject the battery). The loss functions are in dollars: I(a1,p1) = 
0; (a1, p2) = 40 (money spent on a poor battery); [(a2,p1) = 10 (passing up a good 
deal that would cost at least $10 elsewhere); I(a2,p2) = 0. Define the RV X as 


xA 1, if battery starts the car in test trial, 
~ |0, if battery fails to start car in test trial. 


(a) Define the four possible decision functions (d;,i = 1,...,4); 

(b) Compute the risk for each decision function (R(d;;p;),i = 1,...,4;j = 1,2); 

(c) Plot the risk function points in a Cartesian system where the abscissa is 
R(d;p,) and the ordinate is R(d;p2). From the graph, determine which deci- 
sion function is dominated (is worse) by at least one other decision function 
and therefore is inadmissible (not worthy of consideration). 

(d) Suppose it is known that there are twice as many batteries from import B as 
from A; how would this affect your decision? 


For a particular problem involving a simple hypothesis versus a simple alternative, it is 
found by considering all strategies that the set of risk points [R(d; 01), R(d; 02)| associ- 
ated with admissible strategies is approximated by (R(d, 01) — 1)? +(R(d, 92) — 1)” = 
1 for 0 < R(d; 01) < 1,0 < R(d; 62) < 1. It is known that P[6,] = P[@2] = 0.5. What 
is the Bayes strategy in this case? 
Let X:N(yu,1), where wp = yp, = 1/2 or pw = py = —1/2. Let Ay : w = —1/2 and 
Hg: = 1/2. Define the two actions a; : accept Hy(reject Hz) and ag : accept Ho 
(reject H,). The sample space for X is Q = {—oo,co}. Let S; = {—o0,0} and 
Sy = {0,co}. Consider the two mutually exclusive events FE; = {X € S,} and 
Ey = {X € So}. 
(a) Compute the four probabilities P(F;|u;)i = 1,2; 7 = 1,2; 
(b) Define the four possible decision functions d;,i = 1,...,4; 
(c) Assuming the loss functions I(a,, 44) = 0,1(a1, fy) = 2,1 (a2, Uy) = 5, 1 (a2, Ua) 
= 0, compute the risks associated with each of the decision functions in (b). 
Which decision function is inadmissible, that is, there is at least one other 
decision function that dominates (is better than) it? 
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7.10 


7.11 


7.12 
7.13 


7.14 


7.15 


7.16 


7.17 


We have two Normal populations X;: N(,,07) and X2: N({lg,07). We test Hy: yw, = 
[ly versus Ho: fy F fly at a level of significance of 5 percent. Describe the test. 

We have two Normal populations X,:N(j1,,07) and X2: N(j{19,07). We test Hy : 
[ty = Py versus Ho: 4, > py at a level of significance of 5 percent. Describe the test. 
Repeat Problem 7.11 with the change that Hy: , = Uy versus Hg: py < py. 
Assume that we have a Normal random variable upon which we make n i.i.d. observa- 
tions X1,..., Xn. Based on the sample size n we wish to test Hy : 0% = 0? versus He : 
o% #03. Describe the test and show that a simplified test can be done using the x? 
distribution. 

Let X:N (1,07), where it is known that 4 = 1 or 0. We test the hypothesis Hy : = 1 
versus Hz: y= 0 based on only one observation. 


(a) Show that the likelihood ratio test is of the form accept H, : wp = 1 if 
A= exp (a= - 1)) > k: 

(b) Show that an equivalent test is Hy: w=1lif X >c 2 ko? + 0.5; 

(c) For a = 0.02, where a = Plreject H,|H, true], show that the constant c that 
separates the critical from the acceptance region is given by c = 20.920 + 1. 


Here 2.92 is the second percentile of the standard Normal CDF. 
(d) Let o = 1; show that c = —2.05 +1 = —1.05. 


Let X:N(y,1) represent a population whose mean is known to be “ = py, = 3 or 
[L = [lg = 1. We make n i.i.d. observations on X and call these {X;,7 = 1,...,n}. 
Let Hy: w = pw, = 3 and Ag: pw = py = 1; show that the LRT is reduced to accept 
Hy, if ju > (2n)~!In(k) +2 2 cp, where, as usual, fi = + S07, Xi. The constant 
Cn is determined by the significance a. Find a general expression for c, in terms of 
[41,n, and Zq, the latter being the a percentile of the N(0,1) distribution. Assuming 
n =10, what is the value of c,, for a = 0.01? 

(continuation of Problem 7.15) In Problem 7.15 treat n as an unknown and calculate 
the value of n needed to obtain a = 0.02 and @ = 0.01 simultaneously. 
(continuation of Problem 7.16) Keeping a at a = 0.02 show that the number of 
samples needed to achieve a given power follows the graph below. (Hint: Use 
NORMINV (probability, mean, standard deviation) in Excel ™.) 


Number of samples versus 
power 


Number of samples 
needed 


0 T T 
0.85 0.9 0.95 


1 1.05 
Power = 1-8 
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7.18 


7.19 


7.20 


(F-test for comparing variances) The F-test is useful in testing whether the variances 
(or standard deviations) of two Normal populations are the same. Typically we test 
the hypothesis Hy : 0, = 02 versus Hz : 0, # o2. The F-test can be done online by 
entering the data from two Normal populations N(j:,,07) and N(fy,0%) and taking 
the ratio of the sample variances. Thus, assume we have m samples from popula- 
tion Pl {X1;,4 = 1,...,m} and n samples from population P2 {X2;,j = 1,...,n}. 
We do not mix the samples because it is important to keep the sample variances 
independent of each other. One of several programs will compute from the input real- 
izations {vy;,i = 1,...,m} and {x;,7 = 1,...,n} the numerical sample variances, 


often denoted by the symbols s? and s2, as s? 2 (m—1)7! yy (tii — £1)? (degrees 


of freedom DOF = m—1) and s3 2 (n—1)7} yr, (@2j3 — Z2)?(degrees of freedom 
DOF = n—1). In these expressions 7) = m7!" a1; and Z2 = n7! jai £2; 
are the sample numerical means. We need to specify the significance level a. The 
algorithm then proceeds as follows: (1) compute F’ = s}/s3; (2) compare F’ with 
Fo/2,v1,v21 Where Fa/2,1, v2 is the critical value of the F-distribution with m—1 and n— 
1 degrees of freedom and significance a. When testing H, : 0; = 02 versus Hz: 0, > 
o2 reject Hy if F’ > Fay, vo- 
When testing Hy : 01 =o versus Hz: 01 < 09 reject Hy if F’ < Piers 
When testing Hy : 0, = a2 versus Hg : 0, # 02 reject Hy, if F< Pe) 2 aii or 
F> FQ/2,v1,9° 
As an exercise, generate two sets of Gaussian random numbers first with the same 
o and then with different o’s and test the efficacy of the F-test using an online 
calculator, for example, the BioKin statistical calculator. 
(generalizing the F-test for multiple groups) Another way to use the F-test is test 
whether different groups are statistically alike. We shall develop the statistical back- 
ground for this test in this and the next problem. We assume that there are k 
groups with n; samples in each group and aa ni =n. Let the jth sample in 


group 7 be denoted by Y;; and let the group sample mean be defined as Z; & 
We eae = 1,...,h The {¥gi9 = 1,...,.mgjt = 1,.0.,4} are m indepen- 
dent random variables. Within each group, the n; random variables are i.i.d. so that 
Var[Yi;] = oy, for all j samples. By defining Z; the way we did we have generated 
k Normal random variables {Z;,i = 1,...,k} with variances denoted by o7,51 = 
1,...,& and whose overall sample mean is ju = ae Sar Z;. Show the circum- 
. \2 
stances when ae (4%) is xZ_1- Let Var[¥i;] = of, for i=1,...,.K; show that 


o%, = of, /n for i = 1,...,k. Explain why yok, (4; — juz)” is sometimes called 
inter-group variability or between-group variability. 


Yya%s)" 


(continuation of Problem 7.19) Show under what circumstances — al ( a 


ec 
is y2_,, Under what hypothesis can )7*_, (45%) is oni (Ky — Z) /oy,)” 


CZ; 


be written as 771 i (Zi — fiz)" / Dy ky (Vig — Zi)” = x2_y/2_,? Finally crea- 
te the F-statistic by dividing the numerator by k—1 and denominator by n—k, that is, 
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7.21 


k 


Fane = (9 — A) mi (Zi — fg)? /(R-Y po je is 2a)” 
= (n -_ k)xgR_1/(k — ae 


(F-test) We are given the following factual data from [7-6] that tests the oxygen 
assimilating capability of various levels of smokers versus nonsmokers. There are five 
categories: 


Mean respiratory Standard deviation Number of people 


Category flow rate of flow rate in category 
Nonsmokers in smoke-free 3.17 0.74 200 
environment (1) 

Nonsmokers in smoky Dale 0.71 200 
environment (2) 

Light smokers (3) 2.63 0.73 200 
Moderate smokers (4) 2.29 0.70 200 

Heavy smokers (5) 2.19 0.72 200 


7.22 


7.23 


The hypothesis H, is that there is no difference in air flow among the five categories; 
the alternative is that there is at least one category whose respiratory statistics are 
significantly different from the others.? 

Compute whether to accept or reject the hypothesis at the 0.05 significance level. 
(Chi-square test) Plant biologists attempt to test Mendel’s law of hereditary by 
crossing two pea plants. According to Mendel’s law three-fourth of the offspring 
should be green (dominant color) and one-fourth should be yellow (recessive). In 
880 plants, the biologists observe 639 green seeds and 241 yellow seeds. Let Hy: green 
allele’ is dominant and Hy: green allele is not dominant. Determine at the 0.05 level 
of significance whether to accept or reject the hypothesis. 

(t-test) You are given two sets of realizations and told that they come from Normal 
distributions with the same variance. Use a t-test to test Hy : fy = fly versus 


Ag: hy F Me. 
Set 1: 
—5.980e — 1 — 9.290e — 1 — 8.340e — 2 1.020e + 0 6.780e — 1 2.890e — 1 1.430e — 1 — 
2.060e + 0 1.260e + 0 1.670e + 0 

Set 2: 


6.270e — 1 2.640e + 0 1.530e + 0 5.920e — 1 1.910e + 0 5.050e — 1 7.660e — 1 2.760e — 
1 3.070e + 0 8.550e — 1 


tNote that if we reject the hypothesis, we still won’t know which category (or categories) was responsible 
for the rejection. 
+A gene transferring inherited characteristics. 
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7.24 


7.25 


7.26 


7.27 


7.28 


7.29 


7.31 


7.32 
7.33 


7.34 


Show that the statistic V in the Pearson goodness-of-fit test has expectation E[V|H] = 
!—1 under hypothesis Hy. 

Show that the statistic V in the Pearson goodness-of-fit test has expectation E[V|H2] > 
1—1 under the alternative Ho. 

Consider the F-test in testing for the equality of two variances. Plot the test statistic 
versus the variance ratio for m = 8,n = 5. Find the critical region for significance of 
0.05. 

In testing the equality of two variances of two Normal populations with m samples 
from population Pl and n samples from population P2, show that when Hj is true 
A can be written as 


m/2 
(m—1) 
( (n—1) Hsin) 


) (m-+n)/2? 


(m—1) 
(1 + (n—1) Eim—1jn-1 


where A(m,n) 4 (m+ ny) OFM)? m—M/2n-P/2, 

Assume that we have a Normal random variable upon which we make n i.i.d. observa- 
tions X,,..., X,. Based on the sample size n we wish to test Hy : 0% = 02 versus He : 
o% #03. Describe the test and show that a simplified test can be done using the x? 
distribution. 

Despite claims by an older sibling that the coin being used in a monetary betting 
game is “fair,” the younger sibling has his suspicions based on numerous losses. The 
younger sibling believes that the coin is biased as P[head] = 0.55. In the next 50 
tosses, 35 heads appear as opposed to 15 tails. Test the hypothesis that P[head] = 
0.55 at the 0.05 level of significance. 

Twenty-four observations are made on a random variable X and are ordered by size 
as Yi < Yo <--- < Yo4. Estimate the 30th percentile. 

Find a 98 percent confidence interval for the median from 25 samples. 

Perform a run test at the a = 0.05 level of significance on the following sequences: 


From P; : $, = {—0.32, 1.05, 0.77, 0.23, —0.66, —2.03, —0.82, 1.97, —0.32, 1.12} 
From P2 : S2 = {3.2,—10.5, —7.7, —2.3, 6.6, 20.3, 8.2, —19.7, 3.2, —11.2} 


Do S; and Sz come from the same population? 

Consider H, : population P, is the same as population Pz versus H2 : the populations 
are not the same. Perform a run test at the a = 0.05 level of significance on the 
following sequences: 


From Pj : S; = {—2,—6, 8, 4, 2,—4,6,2,—8, —2} 
From P2 : Sz = {3,7,1, —9, -3,5,—-7,3,9,-1} 


Based on the run test do S; and Sz come from the same population? From your 
result can you suggest why the run test is not appropriate in this case? 
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J Random Sequences 


Random sequences are used as models of sampled data arising in signal and image processing, 
digital control, and communications. They also arise as inherently discrete data such as 
economic variables, the content of a register in a digital computer, something as simple as 
coin flipping (Bernoulli trials), or the number of packets on a link in a computer network. 
In each case, the random sequence models the unpredictable behavior of these sources from 
the user’s perspective. In this chapter we will study the random sequence and some of its 
important properties. As we will see, a random (stochastic) sequence can be thought of 
as an infinite dimensional vector of random variables.' As such it stands between finite 
dimensional random vectors (cf. Chapter 5) and continuous-time random functions, called 
random processes, to be studied in the next chapter. 

Another way to generalize the random vector is by doubling the number of index para- 
meters to two, thereby creating random matrices, which have been found useful as mathe- 
matical models in image processing. When these random matrices grow in size, in the infinite 
limit we have a two-dimensional random sequence, used in many theoretical studies in image 
and geophysical signal processing. While we will not study image processing here, many of 
the basic concepts of random sequences carry over to the two-dimensional case. Three- and 
four-dimensional random sequences have been found useful models of unpredictable aspects 
in video and other spatiotemporal signals. 


In the real world all sequences are finite. However, as long as the real-world sequences are long compared 
to internal correlations, the infinite length model does not significantly detract from accuracy except when 
we are at the very beginning or end of the real-world sequence. 
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8.1 BASIC CONCEPTS 


In the course of developing this material we will have need to review and extend some of 
the basic material presented in Chapter 1 on the axioms of probability. This is because we 
must now routinely deal with an infinite number of random variables at one time, that is, a 
random sequence. We start out this study by offering a definition of the random sequence 
followed by a few simple examples. 


Definition 8.1-1 Let (Q,.% P) be a probability space. Let ¢ € Q. Let X[n,¢] be a 
mapping of the sample space 2 into a space of complex-valued sequences on some index set 
Z. If, for each fixed integer n € Z, X[n,¢] is a random variable, then X[n,¢] is a random 
(stochastic) sequence. The index set Z is all the integers, —co < n < +00, padded with 
zeros if necessary. [i 


See Figure 8.1-1 for an illustration for sample space Q = {1,...,10}. We see that 
X(n,¢] for a fixed outcome ¢ is an ordinary sequence of numbers, that is, a determin- 
istic (nonrandom) function of the discrete parameter n. We often refer to these ordinary 
sequences as realizations of the random sequence, or as sample sequences and denote them by 
X¢[n] or merely by x[n] when there is no confusion. Thus, ten sample sequences are plotted 
in Figure 8.1-1, one for each outcome ¢ € Q. On the other hand, for n fized and ¢ variable, 
X[n,¢] is a random variable.' Thus the collection of all these realizations, —oo < n < +00, 
along with the probability space, is the random sequence. We shall often, but not always, 


. ee 


X(n, 6) 
oO 


100 
: 40 is 
7 0 Time index" 
Figure 8.1-1 Illustration of the concept of random sequence X(n,¢), where the ¢ domain (i.e., the 


sample space 22) consists of just ten values. (Samples connected only for plot.) 


+Elementary probability texts talk about an i.i.d. sequence of RVs denoted by Xn,(¢). Our random 
sequence however, allows the added complication of dependence among these RVs. 
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denote the random sequence by just X[n]. We retain the notation X[n,¢] when its use 
helps to clarify a point on the outcomes ¢ of the underlying sample space (2. Note that we 
use square brackets around the time argument n here, as is the convention in discrete-time 
signal processing. 

We give the following simple examples of random sequences: 


Example 8.1-1 
(separable random sequence) Let X[n, ¢] a X(¢)f[n], where X(¢) is a random variable and 
f[n] is a given deterministic (ordinary) sequence. Such a random sequence is the separable 
product of a random variable (function) and an ordinary sequence. We will also write 
X(n| = X f[n], suppressing the outcome ¢ variable, as is the custom for random variables. 
We see that all the sample sequences are just scaled versions of one another, with the scalar 
being the random variable X. 


Example 8.1-2 
(sinusoid with random amplitude and phase) Let X[n, ¢] 4 A(¢) sin(wn/10 + O(¢)), where 
A and 9 are random variables defined on a common probability space (Q,.% P), alternately 
written X[n] = Asin(7n/10+4+ 0). 


These two simple random sequences are made from deterministic components, but they 
are also “deterministic” in another way. They have the unusual property, from a proba- 
bilistic standpoint, that their future values are exactly determined from their present and 
past values. In Example 8.1-1, once we observe X[n] at any fixed value of n, say n = 0, then, 
since the ordinary sequence f[n] is assumed to be known and nonrandom, all of the random 
sequence X[n] becomes known. We see that the random sequence X([n] is conditionally 
known given its value at n = 0. The situation in Example 8.1-2 is just slightly more compli- 
cated but the same approach suffices to show that given two (nondegenerate) observations, 
say at n = 0 and n = 5, one can determine the values taken on by the random variables A 
and ©; then the sequence X[n] becomes conditionally known or perfectly predictable given 
these observations at n = 0 and n = 5. These deterministic random sequences would not be 
good models for noise on a communications channel because real noise is not so easily foiled. 

In the next example we see how a more general but still “deterministic” random 
sequence can be made out of a random vector. 


Example 8.1-3 
(random sequence with finite support) Let X[n,¢] be given by 


0, else. 


Since X[n] = 0 except for n € [1,N], we say X[n] has finite support. Because of this 
finite support property, we can model this random sequence by a random vector X = 
(X1, Xo,..., Xn)" and then use the rich calculus of matrix algebra, for example, covariance 
matrices and linear transformations, as presented in Chapter 5. Many random sequences 
can be approximated this way, although note that we would have to consider the limiting 
behavior of such X, as N — oo, to model a general random sequence. 
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n=0 n=1 n=2\ n=3 Tree level 


Figure 8.1-2 Tree diagram for discrete amplitude random sequence. 


Example 8.1-4 
(tree diagram for random sequence) Let the random sequence X[n] be defined over n > 0, 
and take on only M discrete values, 0,1,2,..., 4 —1. Further assume the starting value is 
pinned at X[0] = 0. Then we can illustrate the evolution of the sample sequences of this 
random sequence with a tree diagram, with branching factor MW at each node n = 0,1,2,... 
as illustrated in Figure 8.1-2. 

At each level n, of the tree, the node values give possible sample sequence values x[n], 
with branch index i = 0,..., 4 —1. The sample sequences are identified by the sequence of 
node values of a path through the tree starting from the root node n = 0. If we identify the 
path string 717973... with the base-M number 0.712273 ..., we can call this point the outcome 
¢ € [0,1] = Q, the sample space.’ Finally we can label the branches with the conditional 
probability P[X[n] = m,|{X[k] on same path for k < n — 1}], which in Figure 8.1-2 is 
denoted as Plin|in—1tn—2-..%10]. Then the probability of any node value at tree level n is 
just given by the product of all the probability branch labels back to the root node along 
this path. Note that all sample sequences that agree up to time n will correspond to a 
neighborhood in the sample space 2 = [0,1] of radius 3M aah, 


This example also has shown how to construct a consistent underlying sample space in 
the common case where we are given just the probability distribution information about the 
set of random variables that make up the random sequence. Note that when the random 
variables are all independent of one another, that is, jointly independent, and this probability 
distribution doesn’t change with time, the branch labels in the tree are all the same, and in 
effect, the tree collapses to one stage. This is the situation called a sequence of i.i.d. random 
variables in probability theory. Generalizing this slightly we have the following definition. 


Definition 8.1-2 An independent random sequence is one whose random variables at 
any time n1,7N2,...,y are jointly independent for all positive integers N. 


+For example let M = 8 and consider the base 8 number 0.1200...0.... This implies that X[1] = 
1, X[2] = 2, and all subsequent values are 0. 
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0 200 400 600 800 1000 1200 


Time n 


Figure 8.1-3 Example of a sample sequence of a random sequence. (Samples connected only for plot.) 


Independent random sequences play a key role in our theory because they are relatively 
easy to analyze, they form the basis of more complicated and accurate models, and it is easy 
to get approximate sample sequences using random number generators on computers. Also 
when the discrete data arises by sampling continuous-time data, statistical independence 
often is a good approximation if the samples are far apart. 

Figure 8.1-3 shows a segment from a real noise sequence, and Figure 8.1-4 shows a close- 
up portion revealing its discrete-time nature and detailed “randomness.” This segment could 
have been taken from anywhere in the noise sequence and the statistical properties would 
have been the same. This remarkable property hints at some form of “stationarity” which 
will shortly be defined (Definition 8.1-5). Note that successive random variables, making 
up this segment, do not appear to be independent. Rather they are evidently correlated, 
necessitating in general an Nth-order probability distribution to statistically describe just 
this segment of this noise sequence. Continuing in this way, we would need an infinite-order 
CDF to characterize the whole random sequence! 

In order to deal with infinite length random sequences, we may have to be able to 
compute the probabilities of infinite intersections of events, for example, the event {X[n] < 
5 for all positive n}, which can be written as either (),°5 {X[n] < 5} or, by De Morgan’s 
laws, in terms of the infinite union (U°~,{X[n] > 5})°. This requires that we can define 
and work with the probabilities of infinite collections of events, which presents a problem 
with Axiom 3 of probability measure: That is, for 4B = ¢ the null set, 


P{|AUB] = P[A]+P[B] (Axiom 3). (8.1-1) 


+Please review Section 1.4 on the definition of infinite intersections and unions. The concept is simple 
but often misunderstood. 
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Figure 8.1-4 Close-up view of portion of sample sequence. 


By iteration we could build this result up to the result 


N 


U An 


n=1 


N 


= S > PIAnl, 


n=1 


P. 


for any finite positive N, assuming A;A; = ¢ for all i # j. This is called finite additivity. 


It will permit us to evaluate limy. P (Cae A,], but what we need above is P[U>~, Anl, 


where A, = {X[n] > 5}. For general functions these two quantities might not be the same, 
that is, limy oo f(aw) # f(limy.+.2y). For this interchange of limiting operations to 
be valid, we need some kind of continuity built into probability measure P. This can be 
achieved by augmenting or replacing Axiom 3 by the stronger infinitely (countably) additive 
Axiom 4 given as 


Axiom 4 (Countable Additivity) 


fore) +oo 
P| |) An| = >_ Pld; (8.1-2) 
n=1 n=1 


for an infinite collection of events satisfying A;A; = @¢ fori ~j. I 


Fortunately, in the branch of mathematics called measure theory [8-1] (see also 
Appendix D), it is shown that it is always possible to construct probability measures satis- 
fying the stronger Axiom 4. Moreover, if one has defined a probability measure P satisfying 
Axiom 3, that is, it is finitely additive, then the Russian mathematician Kolmogorov [8-2], 
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often referred to as the father of modern probability, has shown that it is always possible 
to extend the measure P to satisfy the countable additivity Axiom 4. We pause now for 
an example, after which we will show that Axiom 4 is equivalent to the desired continuity 
of the probability measure P. Henceforth, we will assume that our probability measures 
satisfy Axiom 4, and say they are countably additive. 


Infinite-length Bernoulli Trials 


Let Q = {H,T}, i.e. two outcomes ¢ = H and T, with P[H] = p, with 0 < p < 1, and 
P{T] = q 21-—p. Define the random variable W by W(H) 21 and W(T) © 0, indicative 
of successes and failures in coin flipping. 

Let Q,, be the sample space on the nth flip (the nth copy of 2) and define a new 


event space as the infinite cross product! Q,, 4 X ,-12n- This would be the sample space 
associated with an infinite sequence of flips, each with sample space 2. We then define 


the random sequence W[n, ¢] S W(¢,,), thus generating the Bernoulli random sequence 
Win], n > 1. Here the outcome ¢ is given as the outcomes at the individual trials as 
C= G.Gaoica) 

Consider the probability measure for the infinite dimensional sample space Q.,. Letting 
A, denote an event! at trial n, that is, A, € .%,, where .%, is the field of events in the 
probability space (Q,,.%,, P) of trial n, we need to have (\°—_, An as an event in AQ, the 
o-field of events in Q... To complete this field of events, we will have to augment it with 
all the countable intersections and unions of such events. For example, we may want to 
calculate the probability of the event 


{W[l] = 1,W[2] = 0} U{W[l] = 0, W[2] = 1}, 


which can be interpreted as the union of two events of the form ()7~_, An; that is, {W[1] = 
1,W[2] = 0} = (7, An with Ar = {W[1] = 1}, Ao = {W[2] = 0}, and A, = ,, for 
n > 3. Hence .A%Q . must include all such events for completeness. To construct a probability 
measure on 92,,, we start with sets of the form A, = ‘ee A,, and define in the case of 
independent trials, 


PziAel= ial P[An]. 


We then extend this probability measure to all of A. by using Axiom 4 and the fact that 
every member of .A, is expressible as the countable union and intersection of events of the 
form (\?__, An. We have in principle thus constructed the probability space (Q,-Fo, Poo) 
corresponding to the infinite-length Bernoulli trials, with associated Bernoulli random 
sequence 

Win, c]=We,), m1. 


tHere the infinite cross product X p12 simply means that the points in Q. consist of all the infinite- 
length sequences of events, each one in Q,, for some n. Thus if outcome ¢ € Qo, then € = (¢1,¢2,¢3,---), 
where outcome ¢, is in 2, for each n > 1. (The finite-length case of Bernoulli trials was treated in 
Section 1.9.) 

+Most likely just a singleton event, that is, just one outcome, in this binary case. 
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We have just seen how to construct the sample space 2, for the (infinite-length) 
Bernoulli random sequence, where the outcomes ¢ are just infinite-length sequences of “H” 
and “T.” This W[n] is thus our first nontrivial example of a random sequence. However, 
it may seem a bit artificial to regard each random variable W[n,¢] as a function of the 
infinite dimensional outcome vectors that make up the elements in the sample space Q 4. 
It seems as though we have unnecessarily complicated the situation, after all W[n, ¢] is just 
W (¢,,). To see that this notational complication is unavoidable, let us turn to the commonly 
occurring model for correlated noise, 


X [nj] = se a’—™W ml], for n > 1, (8.1-3) 


m=1 


where W[n] is the Bernoulli random sequence just created. Writing the filtered output X[n] 
for each outcome ¢, 


n 
X[n,¢] = by 0 Cae) 
m=1 
we see that each X[n, ¢] is a function of an ever-increasing (with n) number of components 
of ¢, that is, the value of X[{n] depends on outcomes ¢,,¢5,...,¢,,- If we just dealt with 
each fixed value of n as a separate problem, that is, a separate sample space and probability 
measure, there would be the unanswered question of consistency. This is where, in practice, 
we would call on Kolmogorov’s consistency theorem to show that our results are consistent 
with one sample space 2... which has (infinite-length) outcomes ¢.1 


Example 8.1-5 
(correlated noise) Consider the random sequence in Equation 8.1-3, with |a| < 1. We take 
the Bernoulli random sequence W[n] as input, that is, W[n] = 1 with probability p, and 


W|n] = 0 with probability q¢ Site p. We want to find the mean of X[n] at each positive n. 
Since the expectation operator is linear, we can write 


F{Xin]} = EY 3 


=1 


arwin} 


m=1 m=1 
nm-1 (1— a”) 
=p a” =p 
2 (=a) 


+The use of bold notation for Q.0,€, Poo is rather extravagant but was introduced to avoid confusion. 


Clearly, 2.6 is not the same as lim Q,. Each outcome in Q,y is either a {H} or a {T} no matter how large 
noo 


n gets. On the other hand, the outcomes in ¢ € {oo are infinitely long strings of H’s and T’s. In the future 
we shall dispense with the bold notation even if 2 is generated by an infinite cross product and its elements 
(outcomes) are infinitely long strings. 
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The random sequence X[n] thus created is not a sequence of independent random 
variables, as we can see by calculating the correlation 


B{X[2]X[1]} = E{(aW[]] + W[2]) W[I]} 
= aE {W*(1]} + B{W 2} E{W[1]} 
=ap+p 
# (a+ 1)p? = E{X[2)}£{X[I]}. 


The random variables X[2] and X[1] must be dependent, since they are not even 
uncorrelated. 

However, since the W{n] are uncorrelated we can easily calculate the variance 
Var{X[n]} as 


Var {X[n]} = S© Var {a”-"W[m]} 


m=1 


= So a") Var {W[m]} 


7 (1 _ a2”) 
(1 — a?) Pq 


The dynamics of this random sequence can be modeled using a difference equation. Since 
X[n-—1) = 2) a1" Wm], it follows that X[n] = aX[n — 1] + W[n], a result that 
clearly exhibits the dependence of X [n] on its immediate neighbor X [n—1].' Thus, correlated 
noise X[n] can be generated from the independent sequence W[n] by filtering with the 
configuration shown in Figure 8.1-5. From Equation 8.1-3 we see that for large n, X[n] is 
the sum of a large number of independent random variables. Hence by the Central Limit 


Theorem it will tend to a Gaussian distribution, n — oo, with mean pie and variance 
2n 
pq 

Zero-mean, correlated, Gaussian noise can be generated using the same model. Thus, 

with W/1],W([2],...,W/[n],... denoting zero-mean, independent, identically distributed, 


WIn] (+) > X[n] 
—— 
delay 


Gaina 


Figure 8.1-5 A feedback filter that generates correlated noise X[n] from an uncorrelated sequence 


Wn]. 


+Such explicit dependence in the equation like this is sometimes called direct dependence. 
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Gaussian random variables with N(0, o7,,), the random sequence X[n] = 50" _, a" ~™W [m] 
will be zero-mean, Gaussian with variance 
‘i a2” 
Var{X[n]} = aw: 
where of = Var{W[n]}. Here too, the sequence produced by the filter is correlated since 
E{X[2|X[1]} = aB{W?[1]} = acy # E{X[2}}B{X[1]}} = 0. 


The next example gives a MATLAB method to construct realizations of the Bernoulli 
random sequence and then passes the resulting sample sequences through a first-order filter 
to generate sample sequences of a (more realistic) correlated random sequence. 


Example 8.1-6 
(sample sequence construction) We use MATLAB to construct a sample sequence of W[n]. 
The MATLAB program 


u = rand(40,1); 
w= 0.5 >= u; 
stem (w), 


uses the built-in function “rand” to generate a 40-element vector of uniform random vari- 
ables. The second line sets the vector elements w[n] to 1 if u[n] > 0.5, and to 0 if u[n] < 0.5. 
So w[n] is a sample sequence of the Bernoulli random sequence with p = 0.5. The corre- 
sponding MATLAB plot is shown in Figure 8.1-6. 


y[n] 
oO 
ol 


0 5 10 15 20 25 30 35 40 
Time axis n 


Figure 8.1-6 A sample sequence w([n] for the Bernoulli random sequence Wn]. 
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Figure 8.1-7 First 40 points illustrating startup transient. 


To model the sample sequences of X[n], which we denote x[n], we can filter the sequence 
w[n] with the filter, 


z|n] = ax[n — 1] + w[n], 
which has impulse response h[n] = a” u[n] to realize the linear operation of Equation 8.1-3. 
The corresponding MATLAB m-file fragment is 


b= 1.05 
a = [1.0 -alphal]; 
x = filter(b,a,w); 
stem (x) 


The result for a = 0.95 and a 400-element vector was computed. Figure 8.1-7 shows 
the startup transient for the first 40 values. Figure 8.1-8 shows a sample of the approximate 
steady-state behavior starting at n = 350 and plotted for 50 points. Note the sample average 
value that has built up in x[n] over time. 


Note that the random sequence X[n] has typical noise-like characteristics. The filter has 
correlated the random variables making up X[n] so that sample sequences 2[n] look more 
“continuous.” This simple example is called an autoregressive (AR) model and is widely 
used in signal processing to model both noises and signals. Note that the deterministic 
defect of the initial examples has now been removed. The reason is that the Bernoulli input 
sequence provides a new independent value for every sample, ensuring that the next sample 
cannot be perfectly predicted from the past. 
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Figure 8.1-8 A segment of 50 points starting at n = 350. 


Continuity of Probability Measure 


When dealing with an infinite number of events, we have seen that continuity of the proba- 
bility measure can be quite useful. Fortunately, the desired continuity is a direct consequence 
of the extended Axiom 4 on countable additivity (cf. Equation 8.1-2). 


Theorem 8.1-1 Consider an increasing sequence of events B,,, that is, B, C Bri 
for all n > 1 as shown in Figure 8.1-9. Define By = Up-1 Bai then lity iso PIBa) = 
P[Bx}. 


Proof Define the sequence of events A, as follows: 


ASR: 


An = BnB&_1, n>. 


Figure 8.1-9 Illustrating an increasing sequence of events. 
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The A, are disjoint and Lie An = UN. , Bn for all N. Also By = Ux 
By, are increasing. So 


B,, because the 


n=1 


N N N 
P(By|=P'| || B,| =P ||) Aal => Pia, 
n=1 n=1 n=1 
and 
N 
lim P[By] = lim S~ P[A,] 
n=1 
+co 
= S- P|Ap] by definition of the limit of a sum 
n=1 
=P U An by Axiom 4 
n=1 


This last step results from YU, An =U, Bn 4B... Ey 


Corollary 8.1-1 Let B,, be a decreasing sequence of events, that is, B, D By+1 for 
alln > 1. Then 


lim P[B,] = P[Bx], 


noo 


A co 
Bo = 1), Bn: 


Proof Similar to proof of Theorem 8.1-1 and left to the student. 


where 


Example 8.1-7 
Let B, = {X|k] < 2 for0 < k < n}, for n = 0,1,2,.... In words, B, is the event that 
X[k] is less than 2 for the indicated range of k. Clearly B,+1 is a subset of B,, that is, 
Bnii C By for all n = 0,1,2,.... Also if we set By = {X|k] < 2 for all k > O}, then 
Boo = N72, By. So we can write, by the above corollary, 


P[Boo] = lim P[Bn] 


n—oco 


= lim P[X(0] < 2,..., X[n] < 2]. 


n— Co 


Thus, the corollary provides a way of calculating events involving an infinite number of 
random variables by just taking the limit of the probability involving a finite number of 
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random variables. This type of limiting calculation is often performed in engineering anal- 
yses, and typically without explicit justification (i.e., without worrying about the consistency 
problem mentioned earlier). In this section we have seen that the correctness of the approach 
rests on a fundamental axiom of probability theory, Axiom 4 (countable additivity). 


We next use the continuity of the probability measure P to prove an elementary fact 
about CDFs. 


Example 8.1-8 
(continuity on the right) The CDF is continuous from the right; that is, for Fy(“) = 
P[X(¢) < a] [ef. Property (iii) of Fx in Section 2.3], we have 


1 
lim Fy (2+ ) = Fy(z). 
n 


n— Co 


To show this, we define 
A 


B, {6 XQ) <st+ | 


and note that B, is a decreasing sequence of events, where Boo = ae He A) Sa 
and 


By € 4 *) = P[By). 


By application of Corollary 8.1-1, we get 


lim Fy («+ *) = lim P[Bp] = P[Bo| 
n 


l— Co n—-oco 


= Fx (a). 


Statistical Specification of a Random Sequence 


A random sequence X[n] is said to be statistically specified by knowing its Nth-order CDFs 
for all integers N > 1, and for all times, n,n+1,...,n +N —1, that is, if we know 


Fx (&n, Un41,€n42;-++)En-n—-15n,N+1,...,n+ N—-1) 
rn (8.1-4) 
= P[X|n] < oy, X[n +1] < y41,-.0,X]n +N — 1] < gean-il, 
where the variables after the semicolon, n,n+1,...,2+ N —1, indicate the location of the 


N random variables in this joint CDF. Note that this is an infinite set of CDFs for each 
order N, because we must know the joint CDF at all times n, —oo < n < +00. Incurring 
some penalty in notational clarity, we often write the joint CDFs more simply as 


Fx (fn, n41,-+-;Ln4n—-1), for all n, and for all N > 1. (8.1-5) 


We also define Nth-order CDFs for nonconsecutive time parameters, 
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Fy (Bnis Cnggeee ay My Ma. TN) 


It may seem that this statistical specification is some distance from a complete descrip- 
tion of the entire random sequence since no one distribution function in this infinite set 
of finite-order CDFs describes the entire random sequence. Nevertheless, if we specify all 
these finite-order joint distributions at all finite times, using continuity of the probability 
measure that we have just shown, we can calculate the probabilities of events involving infi- 
nite numbers of random variables via limiting operations involving the finite-order CDFs. 
Of course, we do have to make sure that our set of Nth-order CDFs is consistent within 
itself! Sometimes it is trivial, for instance, the case where all the random variables that 
make up the random sequence are independent of one another, for example, a Bernoulli 
random sequence. 


Example 8.1-9 
(consistency) For consistency, the low-order CDFs must agree with the higher-order CDFs. 
For example, considering just N = 2 and 3, we must have 


Fy Cee ney n,m + 2) = iy (Digs Oy adorn, 2 + ln+ 2) ’ 


for all n, and for all values of 7, and 7,49. Likewise, the N = 1 CDFs must be consistent 
with those of N = 2. Further the consistency must extend to all higher orders N. 


Consistency can be guaranteed by construction, as in the case of the filtered Bernoulli 
random sequence of Example 8.1-6 above. If we were faced with a suspect set of Nth-order 
CDFs of unknown origin, it would be a daunting task, indeed, to show that they were 
consistent. Hence, we see the important role played by constructive models in stochastic 
sequences and processes. 

In summary, we have seen two ways to specify a random sequence: the statistical char- 
acterization (Equation 8.1-4) and the direct specification in terms of the random functions 
X([n, ¢]. We use the word statistical to indicate that the former information can be obtained, 
at least conceptually, by estimating the Nth-order CDFs for N = 1,2,3,... and so forth, 
that is, by using statistics. 

The Nth-order probability density functions (pdf’s) are given for differentiable F'y as 


Je Ceti Se Be ae = 1) 


_ ON Fx (an, 2n41;+++;2n4N_-13%,N+1,...,2+N—1) (8.1-6) 


| 


O02 OEn+1 sa Oln+N-1 


for every integer (time) n and positive integer (order) N. Sometimes we will omit the 
subscript X when only one random sequence is under consideration. Also, we may drop the 
explicit time notation and write 


Ixy Deiycss Bagne). fer Ix Oy tai teens 1 hE = 1: 


We will sometimes want to deal with complex random variables and sequences. By this 
we mean an ordered pair of real random variables, that is, X = (Xp, X1) often written as 
X = Xp + 9X1 with CDF 
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The corresponding pdf is then 


0? Fx (zp, 21) 


dad eae 


To simplify notation we will write fx(x) for fx(xR, 21) in what follows, with the under- 
standing that the respective integrals (sums for discrete valued complex case) are really 
double integrals on the (xp, 21) plane if the random variable is complex.' 

The moments of a random sequence play an important role in most applications. In 
part this is because for a large class of random sequences (so-called ergodic sequences, to be 
covered in Section 10.4 in Chapter 10), they can be easy to estimate from just one sample 
sequence. The first moment or mean function of a random sequence is 


+00 
yxtn] 2 E{X{n]} = : a fix (a; n)de 


igen 


—co 


for a continuous-valued random sequence X[n]. The mean function for a discrete-valued 
random sequence, taking on values from the set {x,,-—00 < k < +00} at time n, is evalu- 
ated as 
+oo 
pxin] = E{X[nJ} = S> oePLX|r] = axl. (8.1-7) 


k=—oco 


In the case of a mixed random sequence, as in the case of mixed random variables, it is 
convenient to write 


+00 +00 
fox [n] = i ufx(a;n)dx + S- x,P[X[n] = xz]. (8.1-8) 
= k=—0o 


Actually using the concept of the Stieltjes integral [8-3] both terms can be rewritten in the 
one form 

+oo 

xin|= f x dF x (x37), 

—co 

in terms of the CDF F'x (an). 
The expected value of the product of the random sequence evaluated at two times 

X(k|X* [I] is called the autocorrelation function and is a two-parameter function of both 
times k and 1, where —oo < k,l < +0, 


+Complex random sequences are used as equivalent baseband models of certain bandpass signals and 
noises. The resulting complex valued simulation can be then run at a much lower sample rate. 
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Rx x(k, l] 2 E{X[KX* I} 
+00 +oo (8.1-9) 
— / / Cpa fx (ae, ck, ldapdaxy, 


when the autocorrelation function exists (the usual case, but of course, in some cases the 
integral might not converge). Most of the time we will deal with second-order random 
sequences, defined by their property of having finite average power E{|X[n]|?} < co. Then 
the corresponding correlation function will always exist. Later we shall see that the conju- 
gate on the second factor in the autocorrelation function definition results in some nota- 
tional simplicities for complex-valued random sequences. We will also define the centered 


random sequence X,[n] 2x [n]— 4x [n], which is zero-mean, and consider its autocorrelation 
function, called the autocovariance function of the original sequence X[n]. It is defined as 


Kxx{k, lS E{(X{k] — wx (el) (X10) - px ll))*}- (8.1-10) 
Directly from these definitions, we note the following symmetry conditions must hold: 
Rxx([k,l] = Ryx[l, kl, (8.1-11) 
Kxx|k,l) = Ky x[l, kl, (8.1-12) 
called Hermitian symmetry. Also note that 


Kxxlk,l] = Rxx[k,] — wx [k]wx[]. (8.1-13) 


The variance function is defined as o%,[n| 4 Kx x([n,n] and denotes the average power 
in X,[n]. The power of X[n] itself has been given above and equals Rx x[n, nJ. 


Example 8.1-10 
(Example 8.1-1 cont’d.) The mean function of X[n] as given in Example 8.1-1 is 


x(n] = B{X[n]} = E{Xfln]} = wx fin), 
where jix is the mean of the random variable X. The autocorrelation function is 
Rxx{k,l] = B{X[R|X" |} = E{X [AX " PS 
= EX|X/?} (kf, 
and so the autocovariance function is given as 
Kxx[k, 0) = E{X|? FIRS — lex PPLE 
= E{|X? — |uxP FAY 
= E{|X — px PP FLAS 
= ox fA, 


where 0% = Var(X). We thus see that the variance o%,[n] is just 0%|f[n]|?. 
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T[n] 

T(5] asi 
T14] 

714 
T(3] — 
T[2] 

T(2] 
TU] " 

' 

n 


0 1 2 3 4 5 


Figure 8.1-10 The 7[n] are arrival times and the T[n] are interarrival times. 


We look at a sequence which fits our notions of randomness better in the next example. 


Example 8.1-11 
(waiting times) Consider the random sequence consisting of i.i.d. random variables T[n] for 
n> 1, each with the exponential pdf of Equation 2.4-16, that is,' 


fr(t;n) = fr(t) = Aexp(—At)u(), Mimsy 2s x04 


Write the running sum of the T[k] up to time n, defined as 


T[n] = > TUR, (8.1-14) 
k=1 
and consider T[n] as a second random sequence for n = 1,2,.... It turns out that the arrival 


of random events in time is often modeled in this way. We say that T[n] is the time to the 
nth arrival or waiting time and we call the T[n] the interarrival times.? See Figure 8.1-10. 

Later, in Chapter 9, we shall see that the important Poisson random process can be 
constructed in this way. Here we want to determine the pdf of T[n] at each n based on the 
definition in Equation 8.1-14. Using the fact that the 7[k] are independent, we can apply 
Equation 4.7-3 and conclude that the pdf of T[n] will be the (n—1)-fold convolution product 
of exponential pdf’s. Using convolution to determine the pdf of T/2], we get 


fr(t;2) = f-() * f-() = dt exp(—At)u(t). 


tRecall that A=1/p. 
*Please regard 7 as a “capital tau” to continue our distinction between a random variable and the value 
it takes on, that is, X = x. 
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0.35 


taxis 


Figure 8.1-11 A plot of the Erlang pdf for \ = 1 and n =3. 
Convolving this result with the exponential pdf a second time, we get 
1 319 
fr(t;3) = 53> t° exp(—At)u(t). 


It turns out that the general form is the Erlang pdf, 


fr(tin) = a ewl-Wul). 


(8.1-15) 


The Erlang or gamma pdf [8-4] is widely used in waiting-time problems in telecommunica- 
tions networks and is plotted via MATLAB in Figure 8.1-11 for n = 3 and A = 1.0, which is 


the waiting time for n = 3 arrivals. 


We can establish this density’s correctness by the Principle of Mathematical Induction. 
(See Section A.4 in Appendix A.) It is composed of two steps: (1) First show the formula is 
correct at n = 1; (2) then show that ifthe formula is true at n—1, it must also be true at n. 
Combining these two steps, we have effectively proved the result for all positive integers n. 

We see that fr(t;1) in Equation 8.1-15 is correct, so we proceed by assuming 
Equation 8.1-15 is true at n — 1. By convolving with the exponential, we can show that 


it is true at n as follows: 


fr(tyn) = fir(t;n — 1) * Aexp(—Ad) ult) 


i (Ar)"~? \5 
=| exp(—Ar) Cail exp(—A(t — r))dr u(t) 
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Using the i.i.d. property of the T[n], we can also compute the mean as 
petri] = np = n(1/2) = n/2 
and variance of the sum T[n] by repeated use of property (A) of Equation 4.3-18. 


Var [T[n]] = nVar[T] = n/d?. 


We next introduce the most widely used random model in electrical engineering, commu- 
nications, and control: the Gaussian (Normal) random sequence. Its wide popularity stems 
from two important facts: (1) the Central Limit theorem (Theorem 4.7-2) assures that many 
processes occurring in practice are approximately Gaussian; and (2) the mathematics is espe- 
cially tractable in problems involving detection, estimation, filtering, and control theory. 


Definition 8.1-3 A random sequence X[n] is called a Gaussian random sequence if 
its Nth-order CDFs (pdf’s) are jointly Gaussian, forall N>1. 


We note that the mean and covariance function will specify a Gaussian random sequence 
in the same way that the mean vector and covariance matrix determine a Gaussian random 
vector (see Section 5.5). This is because each Nth-order distribution function is just the 
CDF of a Gaussian random vector whose mean vector and covariance matrix are expressible 
in terms of the mean and covariance functions of the Gaussian random sequence. 


Example 8.1-12 
(pairwise average) Let Wn] be a real-valued Gaussian i.i.d. sequence with mean py [n] = 0 
for all n and autocorrelation function Ry [k,l] = 07d[k — I], o > 0, where 6 is the discrete- 


time impulse 
AJsl, n=0, 
n= 1 n#0. 


If we form a covariance matrix, then, for a vector of any N distinct samples, it will be 
diagonal. So, by Gaussianity, each Nth-order pdf will factor into a product of N first-order 
pdf’s. Hence the elements of this random sequence are jointly independent, or what we 
call an independent (Gaussian) random sequence (cf. Definition 8.1-2). Next we create the 
random sequence X[n] by taking the sum of the current and previous W[n] values, 
X [nl] 4 Wi[n]+W[n-1], for —-co<n<-+oo. 

Here X[n] is also Gaussian in all its Nth-order distributions (since a linear transformation 
of a Gaussian random vector produces a Gaussian vector by Theorem 5.6-1); hence X[n] is 
also a Gaussian random sequence. We can easily evaluate the mean of X[n] as 
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Figure 8.1-12 Diagram of the tri-diagonal correlation function of Example 8.1-12. 


x(n] = B{X[n]} = E{W[n]} + E{W[n — 1} 


= 0, 
and its correlation function as 


Rxx|k,l] = B{X[k|X[l]} 
= E{(W[k] + W[k — 1) (W{l] + Wil —1))"} 
= E{W[k|W[l]} + E{W[k]W[l — 1)} 
+ E{W[k — JW} + E{W[k — 1]W[t — 1)} 
= Rwwilk,l) + Rwwlk,l-1)+ Rwwl[k-1,0 + Rwwlk-1,1- 1] 
=o" (d[k— 1 + 6[k —1+1) + d[k -—l—1] + 6[k—Q)). 


We can plot this autocorrelation in the (k,/) plane as shown in Figure 8.1-12 and see 
the time extent of the dependence of the random sequence X[n]. 

From this figure, we see that the autocorrelation has value 20? on the diagonal line 1 = k 
and has value o? on the diagonal lines 1 = k + 1. It should be clear from Figure 8.1-12 that 
X [n] is not an independent random sequence. However, the banded support of this covariance 
function signifies that dependence is limited to shifts (k — 1) = +1 in time. Beyond this lag 
we have uncorrelated, and hence in this Gaussian case, independent random variables. 


Example 8.1-13 
(random walk sequence) Continuing with infinite-length Bernoulli trials, we now define a 
random sequence X[n] as the running sum of the number of successes (heads) minus the 
number of failures (tails) in n trials times a step size s, 


X([n] = 3 W Ik] with X[0] =0, 
k=1 


where we redefine W|k] = +s for outcome ¢ =H and W[k] = —s for outcome ¢ =T. 
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2.5 


1.5 


Sample sequence x[n] 


0.5 


0 10 20 30 40 50 60 
Time index n 


Figure 8.1-13 A sample sequence x(n] for random walk X[n] with step size s = 0.2. 


The resulting sequence then models a random walk on the integers starting at position 
X(0] = 0. At each succeeding time unit a step of size s is taken either to the right or to 
the left. After n steps we will be at a position rs for some integer r. This is illustrated in 
Figure 8.1-13. 

If there are k successes and necessarily (n — k) failures, then we have the following 
relation: 


rs =ks—(n—k)s 
= (2k —n)s, 


which implies that k = (n + r)/2, for those values of r that make the right-hand side an 


integer. Then with P[success] = P[failure] = $, we have 


P{X[n] = rs} = P|(n +1) /2 successes] 


(onan) 2% (ube) an integer, r <n 


0, else. 


Using the fact that X[n] = W[1] + W[2] + ...+ WJ[n] and that the W’s are jointly 
independent, we can compute the mean and variance of the random walk as follows: 


E{X(n]} = }7 E{WIk]} = 70 =0, 
k=1 k=1 
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and 


E{X?[n}} = 50 E[W? [KI] 
k=1 


= 5[(+s)? + (—s)?] 


= ns’. 


If we normalize X[n] by dividing ,/n and define 


then by the Central Limit Theorem 4.7-2 we have that the CDF of X[n] converges to the 
Gaussian (Normal) distribution N(0, s?). Thus for n large enough, we can approximate the 
probabilities 


Pla < X[n] < b] = PlaV/n < X[n] < bn] ~ erf(b/s) — erf(a/s). 


Note, however, that when this probability is small, very large values of n might be required 
to keep the percentage error small because small errors in the CDF may be comparable to 
the required probability value. In practice this means that the Normal approximation will 
not be dependable on the tails of the distribution but only in the central part, hence the 
name Central Limit Theorem. 

Note also that while X[n] can never be considered approximately Gaussian for any n 
(e.g., if m is even, X[n] can only be an even multiple of s), still we can approximately 
calculate the probability 


<a 
"lS ie 


1 / > 
= — a —0.5u*)du 
Te Fay pCO" 


= 1/./1(n/2) exp(—r?/2n), 


where r is small with respect to \/n. See Section 1.11 for a similar result. In obtaining 
the last line, we assumed that the integrand was approximately constant over the interval 


[(r — 2)/Vn,r/y/n]. 


The waiting-time sequence in Example 8.1-11 and the random walk in Example 8.1-13 
both have the property that they build up over time from independent components or 
increments. More generally we can define an independent-increments property. 
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Definition 8.1-4 <A random sequence is said to have independent increments if for 
all integer parameters ny < ng <... < ny, the increments X[n1], X[n2] — X[ni], X[n3] — 
X([ngj,...,X[nw~] — X[nn_1] are jointly independent for all integers N>1. 


If a random sequence has independent increments, one can build up its Nth-order 
probabilities (PMFs and pdf’s) as products of the probabilities of its increments. (See 
Problem 8.10.) 

In contrast to the evolving nature of independent increments, many random sequences 
have constant statistical properties that are invariant with respect to the index parameter n, 
normally time or distance. When this is valid, the random model is simplified in two ways: 
First, it is time-invariant, and second, the usually small number of model parameters can 
be estimated from available data. 


Definition 8.1-5 If for all orders N and for all shift parameters k, the joint CDFs of 
(X[n],X[n4+1],...,X[n +N —1]) and (X[n+k],X[n+k+4+1],...,X[n+k+N-—1)) are 
the same functions, then the random sequence is said to be stationary, i.e., for all N > 1, 


Fx (@n,Un41,---;En¢-n—-1572,N4+1,...,n2 +N —1) 
= Fy (@n,0n41,---,Untn-ntk,n+1+k,...,n+N—-1+k) (8.1-16) 


for all —oo < k < +00 and for all x, through x,4y— 1. This definition also holds for pdf’s 
when they exist and PMFs in the discrete amplitude case. [jj 


If we look back at Example 8.1-12, we see that X[n] and W[n] are both stationary 
random sequences. The same was true of the interarrival times T[n] in Example 8.1-11, but 
the random arrival or waiting time sequence Tn] was clearly nonstationary, since its mean 
and variance increase with time n. 

Note that stationarity does not mean that the sample sequences all look “similar,” or 
even that they all look “noisy.”' Also, unlike the concept of stationarity in mathematics and 
physics, we don’t directly characterize the realizations of the random sequence as stationary, 
just the deterministic functions that characterize their behavior, i.e., CDF, PMF, and pdf. 

It is often desirable to partially characterize a random sequence based on knowledge 
of only its first two moments, that is, its mean function and covariance function. This 
has already been encountered for random vectors in Chapter 5. We will encounter this for 
random sequences when we present a discussion of linear estimation in the signal-processing 
applications of Chapter 11. In anticipation we define a weakened kind of stationarity that 
involves only the mean and covariance (or correlation) functions. Specifically, if these two 
functions are consistent with stationarity, then we say that the random sequence is wide- 
sense stationary (WSS). 


+ For example, suppose we do the Bernouilli experiment of flipping a fair coin once and generate a random 
sequence as follows: If the outcome is heads then X[n] = 1 for all n. If the outcome is tails then X [n] = W[n], 
that is, stationary white noise again for all n. Thus, the sample sequences look quite dissimilar, but the 
random sequence is easily seen to be stationary. In Chapter 10, we discuss the property of ergodicity, which, 
loosely speaking, enables expectations (ensemble averages) to be computed from time averages. In this case 
the sample functions would tend to have the same features; that is, a viewer would subjectively feel that 
they come from the same source. 
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Definition 8.1-6 A random sequence X{n] defined for —oo < n < +00 is called 
wide-sense stationary (WSS) if 


(1) The mean function of X[n] is constant for all integers n, —co <n < +00, 


fox [n] = Wx|0] and 


(2) For all times k,l, 00 < k,l < +00, and integers n, —oo <n < +00, the covariance 
(correlation) function is independent of the shift n, 


Kxx|k,l] = Kxyx[k+n,l+nl. B (8.1-17) 


We will call such a covariance (correlation) function shift-invariant. If we think of [k, J] 
as a constellation or set of two samples on the time line, then we are translating this 
constellation up and down the time line, and saying that the covariance function does not 
change. When the mean function is constant, then shift invariance of the covariance and 
correlation functions is equivalent. Otherwise it is not. For a constant mean function, we 
can check property (2) for either the covariance or correlation function. 

While all stationary sequences are WSS, the reverse is not true. For example, the third 
moment could be shift-variant in a manner not consistent with stationarity even though 
the first moment is constant and the second moment is shift-invariant. Then the random 
sequence would be WSS but not stationary. To further distinguish them, sometimes we refer 
to stationarity as strict-sense stationarity to avoid confusion with the weaker concept of 
wide-sense stationarity. 


Theorem 8.1-2 All stationary random sequences are WSS. 


Proof We first show that the mean is constant for a stationary random sequence. 
Let n be arbitrary 


+00 +00 
px[n] = BCX a] = / wfx(o;n)de = 7  fx(0;0)de = psx [0], 


since fx(a;n) does not depend on n. Next we show that the covariance function is shift- 
invariant by first showing that the correlation is shift-invariant: 


Rxx{k,l] = B{X[kxX"[]} 
= i. [- Upe] fx (LE, Vy)drpdx, 


Co [oe} 


+oo p+too 
= / / Pru, rl Case te etn 


[o<) Co 


= Rxx[n+k,n+ J, 


+These middle two lines use our simplified notation. They are not trivially equal because fx (ap, 2)) and 
fx(ktn,®i4n) are really the joint densities at two different pairs of times. This can be made clear using 
the full notation: fx (xp, 213k, 1). 
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since fx (xz, 21) doesn’t depend on the shift n, and the «;’s are dummy variables. Finally, 
we use Equation 8.1-13 and the result on the mean functions to conclude that the covariance 
function is also shift-invariant. Since the covariance function is shift-invariant for any WSS 
random sequence, we can define a one-parameter covariance function to simplify the notation 
for WSS sequences 


Kxx[m] 5 E{X<[k +m] X[k} = Kxx[k +m, kl 
= Kxx|m, 0]. (8.1-18) 


We also do the same for correlation functions. Writing the one-parameter correlation func- 
tion in terms of the corresponding two-parameter correlation function, we have 


Rxx|m| = Rxx[k+m,k| = Rxx[m,0]. 


Example 8.1-14 
(WSS covariance function) The covariance function of Example 8.1-12 is shift-invariant 
and so we can take advantage of the simplified notation. We can thus write Kxx[m] = 
a7 (26[m] + 6[m — 1] + d[m + ]]). 


Example 8.1-15 
(two-state random sequence with memory) We construct the two-level (binary) random 
sequence X[n] on n > 0 as follows. Recursively, and for each n (>0) in succession, and for 
each level, we set X[n] = X|[n—1] with probability p, for some given 0 < p < 1. Otherwise, 


and with probability q¢ 4i- p, we set X[n] to the “other” value (level). Let the two levels 
be denoted a and b, and start off the sequence with X[0] = a. When p = 0.5, this is a 
special case of the Bernoulli random sequence. When p # 0.5, this is not an independent 
random sequence, since Px (&p|Up—1;n,n — 1) # Px(an;n). We say the random sequence 
has memory. To see this, consider the case where p ¥ 1.0; then set x, to the level other than 
XLn—1, and note that the conditional transition probability Py (x@p|vn_1;n,n — 1) + 0, while 
the unconditional probability Px(x,;n) is not so constrained. In fact, Px (en;n) would not 
be expected to favor either level, since the above transition rules are the same for either 
level. Intuitively, at least, it makes sense to call X[n—1] the state at time n—1. In fact, the 
rules for generating this random sequence can be summarized in the state-transition diagram 
shown in Figure 8.1-14, where the directed branches are labeled by the relevant probabilities 
for the next state, given the present state, as can easily be verified by inspection. We can 
refer to p as the no-transition probability. This is a first example of a Markov random 
sequence which will be studied in Section 8.5. 

The following MATLAB m-file can generate sample functions for these random sequences 
onn>1: 


function[w]=randmemseq(p,N,w0,a,b) 
w=ax*ones(1,N); 
w(1)=w0; 
for i=2:N 
rnum=rand; 
if rnum <p; 
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=1 


eSc: 


Figure 8.1-14 State-transition diagram of two-state (binary) random sequence with memory. 


w(i)=w(i-1); 
else 
if w(i-1)==a; 
w(i)=b; 
else 
w(i)=a; 
end 
end 
stem(1:N,w) 
title(’random sequence with memory’ ) 
xlabel(’discrete time’) 
ylabel(’level’) 
end 


Sample waveforms are given in Figures 8.1-15 to 8.1-17 corresponding to level values 
b =1,a=0, and several values of p. We note that when p is near 1, there are few transitions. 
For p near 0.5, there will be many transitions displaying little memory. When p = 0, there 
is a transition every time. 


Example 8.1-16 
(correlation function of random sequence with memory) Assume that the random sequence 
with memory of the last example has been running for a very long time. Later on we will 
show that in this case, a steady state develops wherein the probabilities of the two levels 
are constant with time and independent of the starting state (level). Here we assume that 
the steady state holds for all finite time. Clearly from the symmetry shown in the state 
diagram, it must be that Px(a) = Px(b) = 0.5. Now assume that the lower level a = 0 
and the upper level is b as before, and consider the correlation at two distinct times n and 
n-+k. We can write 


Rxx[n, n+ k] =v? Px(b,b;n,n +k) 


= b? P(X [n] = b)P(X[n +k] = |X [n] =) 
= (b?/2) P(X[n + k] = 0|X[n] =), 
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Level 


0 


Time index n 


Figure 8.1-15 Initial level X [1] = 1, no-transition probability p = 0.8. 
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Figure 8.1-16 Initial level X [1] = 1, no-transition probability p = 0.5, the Bernoulli case. 
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Figure 8.1-17 Initial value X [1] = 1, no-transition probability p = 0. 
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where the first equality holds since all the terms involving a are zero since a = 0. Now the 
only way that X can equal b at both times n and n+ is for an even number of transitions 
to occur between these two times, and the probability of this is given by 


k 
P{even number of transitions} = ye (7) (1 —p)'p*! 


1=0,2,4,... 
= As. 
which follows from the fact that this is just Bernoulli trials with “success” = “transition” 
and “failure” = “no transition.” Thus interchanging the usual role of p and q in Bernoulli 


trials, we just add up the probability of an even number of successes (transitions). It turns 
out that A. can be evaluated in closed form by the following “trick.” Define 


k 
aS Giana: 
1=1,3,5 533; 


Clearly, we have A, — A, = 1 since / is always odd valued in the sum Ag. Similarly we note 
that 


Ae 


| II 
a at eh 
“—™~ 
~ oy 
aoa 
os. 
fan 
| 
Ss, 
i. 
Ea 
A 
| 
= 


k a 
—_ (7) oe r 
l=0,2,4,... 


where the first equality holds because | is always even in A,. We now can see that 


Act as= 3 (*) pai 


1=0 
— (2p — iy 
by the Binomial Theorem. It follows at once that A, = (1/2) [(2p — 1)* + 1] , so that 
Ryx(nn +h = (8/4) [(2p—1* +1, 


which shows that X|n] is WSS. We can write this correlation function more cleanly for the 
case p > 1/2. On defining a 4 In(2p — 1)|, we have 


Rxx[k] = (b°/4) [exp(—a |k|) + 1]. 
Also since the mean value of X[n] is easily seen to be b/2, we get the autocovariance function 


Kx x[k] = (b°/4) exp(—a hl). 
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A MATLAB m-file for displaying the covariance functions of these sequences, for three values 
of p, is shown below: 


function [mc1 ,mc2,mc3]=markov(b,p1,p2,p3,N) 
mci=0*ones(1,N); 
mc2=0x*ones(1,N); 
mc3=0*ones(1,N); 
for i=1:N 
mc1(i)=0.25* (b*2)*(((2*p1-1)*(i-1))); 
mc2(i)=0.25* (b*2)*(((2*p2-1)*(i-1))); 
mc3(i)=0.25* (b*2) *(((2*p3-1)*(i-1))); 
end 
x=linspace(0,N-1,N); 
plot (x,mc1,x,mc2,x,mc3) 
title(‘covariance of Markov Sequences’) 
xlabel(‘Lag interval’) 
ylabel(‘covariance value’) 


The normalized covariances for p = 0.8, 0.5, and 0.2 and b = 2 are shown in Figure 8.1-18. 


Covariance value 


“0 2 4 6 8 10 12 14 


Lag interval 


Figure 8.1-18 The covariance functions for different values of the parameter p. (Points connected by 
straight lines.) 
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This type of random sequence, which exhibits a one-step memory, is called a Markov 
random sequence (there are variations on the spelling of Markov) in honor of the mathemati- 
cian A. A. Markov (1856-1922). In Section 8.5 we shall discuss this class of random sequences 
in greater detail. In the meanwhile we note that the system discussed in Example 8.1-5, 
that is, X[n] = aX|[n — 1] + W[n], also exhibited a one-step memory and, hence could also 
be regarded as a Markov sequence, when W/[n] is an independent random sequence. 


In Section 8.2, we provide a review or summary of the theory of linear systems for 
sequences, that is, discrete-time linear system theory. Readers with adequate background 
may skip this section. In Section 8.3, we will apply this theory to study the effect of 
linear systems on random sequences, an area rich in applications in communications, signal 
processing, and control systems. 


8.2 BASIC PRINCIPLES OF DISCRETE-TIME LINEAR SYSTEMS 


In this section we present some fundamental material on discrete-time linear system theory. 
This will then be extended in the next section to the case of random sequence inputs 
and outputs. This material is very similar to the continuous-time linear system theory 
including the topics of differential equations, Fourier transforms, and Laplace transforms. 
The corresponding quantities in the discrete-time theory are difference equations, Fourier 
transforms (for discrete-time signals), and Z-transforms. 

With reference to Figure 8.2-1 we see that a linear system can be thought of as having 
an infinite-length sequence x[n] as input with a corresponding infinite-length sequence y{n] 
as output. Representing this linear operation in equation form we have 


y[n] = L{x[nJ}, (8.2-1) 


where the linear operator L is defined to satisfy the following definition adapted to the 
case of discrete-time signals. This notation might appear to indicate that x[n] at time n 
is the only input value that affects the output y[n] at time n. In fact, all input values 
can potentially affect the output at any time n. This is why we call L an operator’ and 
not merely a function. The examples below will make this point clear. Mathematicians 


x[n] ———> L{e} — yin 


Figure 8.2-1 System diagram for generic linear system L{-} with input x{n] and output y[n] and time 
index parameter n. 


+Operators map functions (sequences) into functions (sequences). 
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use the operator notation y = L{x} which avoids this difficulty but makes the func- 
tional dependence of « and y on the (time) parameter n less clear than in our engineering 
notation. 


Definition 8.2-1 Wesay asystem with operator L is linear if for all permissible input 
sequences x(n] and 29[n], and for all permissible pairs of scalar gains a; and a2, we have 


L {a,x [n] + agxe[n]} = ay L{xy[n]} + agLl{xo[n]}. 


In words, the response of a linear system to a weighted sum of inputs is the weighted sum 
of the individual outputs. Examples of linear systems would include moving averages such as 


y[n] = 0.33(a[n + 1] + 2[n] + a[n — 1), —oo <n < +00, 


and autoregressions such as, 


y[n| = ay[n — 1] + by[n — 2] + ca[n], 0<n<+o0, 


when the initial conditions are zero. Both these equations are special cases of the more 
general linear constant-coefficient difference equation (LCCDE), 


M N 
y(n] = S- axy|n — k] + be bpa[n — ky}. (8.2-2) 
k=1 k=0 


Example 8.2-1 
(solution of difference equations) Consider the following second-order LCCDE, 


y(n] = 1.7y[n — 1] — 0.72y[n — 2] + ufnl, (8.2-3) 


with y[—1] = y[—2] = 0 and u[n] the unit-step function. To solve this equation for n > 0, 
we first find the general solution to the homogeneous equation 


yn[n] = 1.7yp[n — 1] — 0.72yp,[n — 2]. 
We try y;[n] = Ar”, where A and r are to be determined,’ and obtain 


A(r” — 1.7r"! +0.72r"~7) = 0 


+A thorough treatment of the solution of linear difference equations may be found in [8-5]. 
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or 
Ar”? (r? — 1.7r + 0.72) =0. 


We thus see that any value of r satisfying the characteristic equation 
r? —1L.7r+0.72=0 


will give a general solution to the homogeneous equation. In this case there are two roots 
at r; = 0.8 and rg = 0.9. By linear superposition the general homogeneous solution must 
be of the form 


yn[n] = Air? + Aor, 


where the constants A; and Az may be determined from the initial conditions. 

To obtain the particular solution, we first observe that the input sequence u[n] equals 1 
for n > 0. Thus we try as a particular solution a constant, that is, following standard 
practice, 

yp|n] = B for n > 0 


and obtain 
B-1.7B+0.72B=1 


or 
B=1/(1—1.7+0.72) = 1/(0.02) = 50. 


More generally this method can be modified for any input function of the form Cp” 
over adjoining time intervals [n1,n2 — 1]. One just assumes the corresponding form for 
the solution and determines the constant C' as shown. In this approach, we would solve the 
difference equation for each time interval separately, piecing the solution together at the 
boundaries by carrying across final conditions to become the initial conditions for the next 
interval. We illustrate our approach here for the time interval starting at n = 0. The total 
solution is 


y{n] = yn[nr] + yp|n| 
= A,(0.8)” + A2(0.9)” + 50 for n> 0. 


To determine A; and Ag, we first evaluate Equation 8.2-3 at n = 0 and n = 1 using 
y(—1] = y|—2] = 0 to carry across the initial conditions to obtain y[0] = 1 and y[1] = 2.7, 
from which we obtain the linear equations 


A, + Ap +50=1 (atn=0) 


+Since the two roots are less than one in magnitude, the solution will be stable when run forward in 
time index n (cf. [8-5]). 
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and 
A;(0.8) + Ao(0.9) +50 = 2.7 (at n=1). 


This can be put in matrix form 


1.0 1.0] [41] _ [—49.0 
0.8 0.9| | Ao} | -47.3 


[a] [a 


Thus the complete solution, valid for n > 0, is 


and solved to yield 


y[n] = 32(0.8)” — 81(0.9)” + 50. 
We could then write the solution for all time, if the system was at rest for n < 0, as 


y(n] = {32(0.8)” — 81(0.9)” + 50} uln]. 


Note that the LCCDE in the previous example is a linear system because the initial 
conditions, that is, y[—1], y[—2], were zero, often called the initial rest condition. Without 
initial rest, an LCCDE is not a linear system. More generally, linear systems are described 
by superposition with a possibly time-variant impulse response 


h[n, k] 2 L{d[n — k]}. 


In words we call h[n, k] the response at time n to an impulse applied at time k. We derive 
the result by simply writing the input as x[n] = >> 2[k]é[n — k], and then using linearity to 
conclude 


= S© afklL{6{n — k]} 


k=—00 


= S> afkl hn, 4], 


k=—0o 


which is called the superposition summation representation for linear systems. 

Many linear systems are made of constant components and have an effect on input 
signals that is invariant to when the signal arrives at the system. A linear system is called 
linear time-invariant (LTI) or, equivalently, linear shift-invariant (LSI) if the response to 
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a delayed (shifted) input is just the delayed (shifted) response. More precisely, we have the 
following. 


Definition 8.2-2 A linear system L is called shift-invariant if for all integer shifts k, 
—oo < k < +00, we have 


y[n +k] = L{aln+k]} for all n. | (8.2-4) 


An important property of LSI systems is that they are described by convolution,' that. is, 
L is a convolution operator, 


yln] = h[n] + a[n] = a[n] * h[n], 


where tie 
h[n] * a(n] = S> h[k]e[n — ki, (8.2-5) 
k=—0o 
and the sequence 
h{n] = L{d[n]}, 


is called the impulse response. With relation to the time-varying impulse response h[n, k], 
we can see that h[n] = h[n,0] when a linear system is shift-invariant. 


In words we can say that—just as for continuous-time systems—if we know the impulse 
response of an LSI system, then we can compute the response to any other input by carrying 
out the convolution operation. In the discrete-time case this convolution operation is a 
summation rather than an integration, but the operation is otherwise the same. 

While in principle we could determine the output to any input, given knowledge of the 
impulse response, in practice the calculation of the convolution operation may be tedious 
and time consuming. To facilitate such calculations and also to gain added insight, we turn 
to a frequency-domain characterization of LSI systems. We begin by defining the Fourier 
transform (FT) for sequences as follows. 


Definition 8.2-3 The Fourier transform for a discrete-time signal or sequence is 
defined by the infinite sum (if it exists) 


+00 
X(w) = FT {2z[n]} S x z[nje2*", for —t<w <-+n, 


n=—oo 
and the function X(w) is periodic with period 27 outside this range. The inverse Fourier 
transform is given as 
i 


ee a 


x|n] = IFT {X(w)} X(w)ei*"du. 


+We encountered the operation of convolution in Chapter 3 when we computed the pdf of the sum of 
two independent RVs. 
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One can see that the Fourier transform and its inverse for sequences are really just 
the familiar Fourier series with the sequence x playing the role of the Fourier coefficients 
and the Fourier transform X playing the role of the periodic function. Thus, the existence 
and uniqueness theorems of Fourier series are immediately applicable here to the Fourier 
transform for discrete-time signals. Note that the frequency variable w is sometimes called 
normalized frequency because, if the sequence «[n| arose from sampling, the period of such 
sampling has been lost. It is as though the sample period were T' = 1, as would be consistent 
with the [—7,+7] frequency range of the Fourier transform X (w).1 

For an LSI system the Fourier transform is particularly significant owing to the fact 
that complex exponentials are the eigenfunctions of discrete-time linear systems, that is, 


Lie?" = Aaj, (8.2-6) 


as long as the impulse response h is absolutely summable. For LSI systems this absolute 
summability can easily be seen to be equivalent to bounded-input bounded-output (BIBO) 
stability [8-5]. 

Just as in continuous-time system theory, multiplication of Fourier transforms cor- 
responds to convolution in the time (or space) domain. 


Theorem 8.2-1 (convolution theorem) The convolution, 
y(n] = x[n] « An}, —oo <n < +00, 


is equivalent in the transform domain to 


Yw)=X(w)HWw), —-m<w<tr 
Proof 
+00 +oo 
Y(w) = x y[njee” = s (x[n] * h[n]) eI"" 

= » > a[k]h[n — kle 7?" = ‘> a alkjh[n — fle sek) 
nok n k 

= 2 Viz lkle-*aln — ker) 
nek 


> 


=) alkjeI** H(w) 


TIf the sequence arose from sampling with sample period T, the (true) radian frequency Q = w/T. 
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Thus, discrete-time linear shift-invariant systems are easily understood in the frequency 
domain similar to the situation for continuous-time LSI systems. Analogous to the Laplace 
transform for continuous-time signals, there is the Z-transform for discrete-time signals. It 
is defined as follows. 


Definition 8.2-4 The Z-transform of a discrete-time signal or sequence is defined as 
the infinite summation (if it exists) 


+oo 


X(z) = > afn]e, (8.2-7) 


n=—co 
where z is a complex variable in the region of absolute convergence of this infinite sum.’ jy 


Note that X(z) is a function of a complex variable, while X(w) is a function of a real 
variable. The two are related by X(z)|--civw = X(w). We thus see that, if the Z-transform 
exists, the Fourier transform is just the restriction of the Z-transform to the unit circle in 
the complex z-plane. Similarly to the proof of Theorem 8.2-1, it is easy to show that the 
convolution-multiplication property Equation 8.2-1 is also true for Z-transforms. Analogous 
to continuous-time theory, the Z-transform H(z) of the impulse response h[{n] of an LSI 
system is called the system function. For more information on discrete-time signals and 
systems, the reader is referred to [8-5]. 


8.3 RANDOM SEQUENCES AND LINEAR SYSTEMS 


In this section we look at the topic of linear systems with random sequence inputs. In 
particular we will look at how the mean and covariance functions are transformed by both 
linear and LSI systems. We will do this first for the general case of a nonstationary random 
sequence and then specialize to the more common case of a stationary sequence. The topics of 
this section are perhaps the most widely used concepts from the theory of random sequences. 
Applications arise in communications when analyzing signals and noise in linear filters, in 
digital signal processing for the analysis of quantization noise in digital filters, and in control 
theory to find the effect of disturbance inputs on an otherwise deterministic control system. 

The first issue is the meaning of inputing a random sequence to a linear system. The 
problem is that a random sequence is not just one sequence but a whole family of sequences 
indexed by the parameter ¢, a point (outcome) in the sample space. As such for each fixed 
¢, the random sequence is just an ordinary sequence that may be a permissible input for 
the linear system. Thus, when we talk about a linear system with a random sequence input, 
it is natural to say that for each point in the sample space 2, we input the corresponding 
realization, that is, the sample sequence a[n]. We would therefore regard the corresponding 
output y(n] as a sample sequence? corresponding to the same point ¢ in the sample space, 
thus collectively defining the output random sequence Y [n]. 


*Note the sans serif font to distinguish between the Z-transform and the Fourier transform. 
tRecall that a[n], y[n] denote X[n, ¢], Y[n, ¢], respectively, for fixed ¢. 
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Definition 8.3-1 When we write Y[n] = L{X[n]} for a random sequence X[n] and 
a linear system LD, we mean that for each ¢ € 2 we have 


Y[n,¢] = L{X[n,¢]}- 


Equivalently, for each sample function x[n] taken on by the input random sequence X [nl, 
we set y[n] as the corresponding sample sequence of the output random sequence Y [n], that 
is, y{n] = {z(n]}. i 


This is the simplest way to treat systems with random inputs. A difficulty arises when 
the input sample sequences do not “behave well,” in which case it may not be possible to 
define the system operation for every one of them. In Chapter 10 we will generalize this 
definition and discuss a so-called mean-square description of the system operation, which 
avoids such problems, although of necessity it will be more abstract. 

In most cases it is very hard to find the probability distribution of the output from 
the probabilistic description of the input to a linear system. The reason is that since the 
impulse response is often very long (or infinitely long), high-order distributions of the input 
sequence would be required to determine the output CDF. In other words, if Y[n] depends 
on the most recent & input values X[n],...,X|[n —k + 1], then the kth-order pdf of X 
has to be known in order to compute even the first-order pdf of Y. The situation with 
moment functions is different. The moments of the output random sequence can be calcu- 
lated from equal- or lower-order moments of the input, when the system is linear. Partly for 
this reason, it is of considerable interest to determine the output moment functions in terms 
of the input moment functions. In the practical and important case of the Gaussian random 
sequence, we have seen that the entire probabilistic description depends only on the mean 
and covariance functions. In fact because the linear system is in effect performing a linear 
transformation on the infinite-dimensional vector that constitutes the input sequence, we 
can see that the output sequence will also obey the Gaussian law in its nth-order distribu- 
tions if the input sequence is Gaussian. Thus, the determination of the first- and second- 
order moment functions of the output is particularly important when the input sequence is 
Gaussian. 


Theorem 8.3-1 For a linear system L and a random sequence X[n], the mean of the 
output random sequence Y [n] is 


E{Y |n]} = L{E{X|n]}} (8.3-1) 


as long as both sides are well defined. 


Proof (formal). Since L is a linear operator, we can write 


+00 
yln] = So hn, keh] 


k=—0o 
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for each sample sequence input-output pair, or 
+00 
do Aln, kX (kd, 


k=—0o 


where we explicitly indicate the outcome ¢. If we operate on both sides with the expectation 


operator EF, we get 
E{Y [n] }=2{ 3 \ h[n, k]X nib 


k=—0o 


Now, assuming it is valid to bring the operator EF inside the infinite sum, we get 


+00 
E{Y[n}= SO bln, kELX[R]} 


k=—0o 


= L{E{X|n]}}, 


+00 
S> hl, Kex (Al 


k=—00o 
that is, the mean function of the output is the response of the linear system to the mean 
function of the input. 


which can be written as 


Some comments are necessary with regard to this interchange of the expectation and 
linear operator. It cannot always be done! For example, if the input has a nonzero mean 
function and the linear system is a running sum, that is, 

+oo 
yln] = >) 2[n— i, 
k=0 
the running sum of the mean may not converge. Then such an interchange is not valid. We 
will come back to this point when we study stochastic convergence in Section 8.7. We will 
see then that a sufficient condition for an LSI system to satisfy Equation 8.3-1 is that its 
impulse response h[n] be absolutely summable. 

There are special cases of Equation 8.3-1 depending on whether the input sequence is 
WSS and whether the system is LSI. If the system is LSI and the input is at least WSS, 
then the mean of the output is given as 


E{Y[n] 7s h[n — 
k=—00 


Now because ux is a constant, we can take it out of the sum and obtain 


E{Y[n] = hk . (8.3-2) 


k=—00 


= H(z)|2=1 Hx; (8.3-3) 
at least whenever 772° .. |h[k]| exists, that is, for any BIBO stable system (cf. Section 8.2). 
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Thus, we observe that in this case the mean of the output random sequence is a constant 
equal to the product of the dc gain or constant gain of the LSI system times the mean of 
the input sequence. 


Example 8.3-1 
(lowpass filter) Let the system be a lowpass filter with system function 


H(z) = 1/(L+az~"), 


where we require |a| < 1 for stability of this assumed causal filter (ie., the region of 
convergence is |z| > |a], which includes the unit circle). Then if a WSS sequence is the 
input to this filter, the mean of the output will be 


E{Y |n]} = H(z)|-1E{ X[n}} 


= (1+a)“*px. 


We now turn to the problem of calculating the output covariance and correlation of the 
general linear system whose operator is L: 


Y [n] = L{X[nl}. 


We will find it convenient to introduce a cross-correlation function between the input 
and output, 


Rxy[m,n] & E{X[m]Y*[n]} (8.3-4) 
= E{X[m] (L{X[n]})*}. (8.3-5) 


Now, in order to factor out the operator, we introduce the operator L*, with impulse 
response h*[n,k], which operates on time index k, but treats time index n as a constant. 
We can then write Rxy|m,n| = E{X[m]Le[|X*[n]]} = LAE{X|[m]X*[n]}. Similarly we 
denote with L,,, the linear operator with time index m, that treats n as a constant. The 
operator L* is related to the adjoint operator studied in linear algebra. 


Theorem 8.3-2 Let X[n] and Y[n] be two random sequences that are the input 
and output, respectively, of the linear operator L,,. Let the input correlation function be 
Rxx|m,n]. Then the cross- and output-correlation functions are, respectively, given by 


Rxy|m,n] = L; {Rxx|m,n]} 


and 
Ryy|m, n| = ae {Rxy|m, nj} 7 
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Proof Write 
X|[m]¥*[n] = X[m]L,{X"*[n]} 
= 18{X|m]X"[n]}. 
Then 
Rxy(m,n] = E{X[m]¥*[n]} = E{L{X[m]X*[n]}} 
= L{E{X[m]X*|n]}} 
= DA{Rxx [m, nj}, 


thus establishing the first part of the theorem. To show the second part, we proceed analo- 
gously by multiplying Y[m] by Y*[n] to get 


E{Y [m]Y*[n]} = E{Lm{X[m]¥*[n]}} 
= Lm{ E{X|m]¥"[n]}} 
= Lm{Rxy|m,n}}, 


as was to be shown. 
If we combine both parts of Theorem 8.3-2 we get an operator expression for the output 
correlation in terms of the input correlation function: 


Ryy(m,n] = Ly{Le{Rxx|m,n]}}, (8.3-6) 


which can be put into the form of a superposition summation for a system with time-variant 
impulse response h[n, k] as 


+00 
Ryy[m,n] = Ss him, k] ( S- tle) ; (8.3-7) 
k=—0o l=—0o 


Here the superposition summation representation for Rx y[m,n] is 


Rxy[m,n] = L*{Rxx[m, n]} 


+00 
— ¥; h* [n, URxx|[m, I), 


l=—oco 


and that for Ry x|[m, n] is 


Ryx|m, n| => him A) Rxx[k, nj. 


k=—00 


To find the corresponding results for covariance functions, we note that the centered output 
sequence is the output due to the centered input sequence, due to the linearity of the system 
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and Equation 8.3-1. Then applying Theorem 8.3-2 to these zero-mean sequences, we have 
immediately that, for covariance functions, 


Kxy|m,n] => L*{Kxx[m,n]} (8.3-8) 
Kyy|[m,n] => Lin{Kxy|m, n]} (8.3-9) 

and 
Kyy|[m,n] => Lm{LF {Kx x|m,n}}}, (8.3-10) 


which becomes the following superposition summation 


Kyy|m,n] = s Alm, k] (x So h*[n, YK xx[k, i). (8.3-11) 


k=—0o Il=—oco 


Example 8.3-2 
(edge detector) Let Y{n] 2 X([n| — X[n — 1] = L{X[n]}, an operator that represents a 
first-order (backward) difference. See Figure 10.3-1. This linear operator could be applied 
to locate an impulse noise spike in some random data. The output mean is F[Y[n]] = 
L{E[X[n]]} = ux[n] — wx [n — 1]. The cross-correlation function is 


Rxy[m,n] = Ln{ Rx x|m, n]} 
= Rxx[m,n] — Rxx|[m,n-— 1]. 
The output autocorrelation function is 
Ryy|m,n] = Lm{Rxy|m,n}} 

= Rxy|m,n] — Rxy[m—1,n] 

= Rxx|m,n] — Rxx[m—1,n] — Rxx[m,n—1]+ Rxx[m—1,n- 1]. 
If the input random sequence were WSS with autocorrelation function, 

Rxx|[m,n| = gi", 0<a<l, 


then the above example would specialize to 


X{n] Y[n] 


Delay unit 


Figure 8.3-1 An edge detector that gives nearly zero output when X[n] % X[n— 1] and a large output 
when |X[n] — X[n — 1]] is large. 
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Input correlation function 
oO 
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Shift parameter k 


Figure 8.3-2 Input correlation function for edge detector with a = 0.7. 


Rxy[m,n] = git _ qit-nF]| 
and 
Ryy[m, n] = Qqim™—"l = git-l-n| _ grr, 


which depends on only m—n. Hence the output random sequence is WSS and we can write 
(with k = m—n) 

Ryy[k] = 2a!*l — glF-41 _ glk+1, 
For the input autocorrelation with a = 0.7 as shown in Figure 8.3-2, the output autocorre- 


lation function is shown in Figure 8.3-3. Note that the edge detector has a strong tendency 
to decorrelate the input sequence. 


Example 8.3-3 
(covariance functions of a recursive system) With |a| < 1, let 


Y[n] = aY[n—- 1] + (1-—a)W[n] (8.3-12) 


for n > 0 subject to Y[—1] = 0. Since the initial condition is zero, the system is equivalently 
LSI for n > 0, so we can represent L by convolution, where 


A[n] = (1 — aja” un]. 


Here h[n] is the impulse response of the corresponding deterministic first-order difference 
equation, that is, h[n] is the solution to 
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Correlation function Ry{k] 


eae ea 


10 -8 6 4 2 O 2 4 6 8 10 
Shift parameter k 


Figure 8.3-3 Correlation function Ryy[k] for backward difference example (plot has a = 0.7). 


h[n] = ah[n — 1] + (1 — a) d[nl, 


where 6[n] is the discrete-time impulse sequence. This solution can be obtained easily by 
recursion or by using the Z-transform.' Then specializing Equation 8.3-1, we obtain 


co 


=> ( 1—a)a* py [n — k], where pyy[n] = 0 for n <0. 
k=0 


Applying Equations 8.3-8 and 8.3-9 to this case enables us to write, for @ real, 
Kwy|m,n] = $0 (1 - a)a*Kww[m,n— kl 

and 
Kyy|m,n] = ya —aja' Kyy|m — 1, nl, 


which can be combined to yield 


Kyy|m,n] =S> SJ (1 - a)?a*alKwwlm —1,n— ki]. 
k=0 1=0 


+Taking the Z-transform of both sides of the above equation, and noting that the Z-transform of the 
impulse sequence is 1, we obtain H(z) = (1 — a)/(1 — az~!). Upon applying the inverse Z-transform, one 
gets the h[n] given above. (For help with the inverse Z-transform, see Appendix A.) 
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Now if the input sequence W[n] has covariance function 
Kww(m,n] = 0%,6[m — nj for m,n > 0, 


then the output covariance is calculated as 


n 
Kyy|m,n] => (1— a) Ra lm—n)}t he? for m>n>0, 
k=0 


= g(™ n) oy oN ety 2 


= al™") [1 - ra —a’)| of (1 —a?"*?) for m>n>0 
= [(1 —a)/(1 + aja! 62, (1 — a2 min )+2) for all m,n > 0, 


where the last step follows from the required symmetry in (m,n). Note that the term 
q?min(m.n)+2 is a transient that dies away as m,n — oo, since |a| < 1, so that asymptotically 
we have the steady-state answer 

Kyy|m,n] = a ot alm" m,n — co 

? 1 +a Ww ) ) ? 

a shift-invariant covariance function. If the mean function py-[n] is found to be asymptotic 
to a constant, then the random sequence Y [n] is said to be asymptotically WSS. We discuss 
WSS random sequences in greater detail in the next section. 


As an alternative to this method of solution, one can take the expectation of 
Equation 8.3-12 to directly obtain a recursive equation for the output mean sequence which 
can be solved by the methods of Section 8.2: 


by[n] = apy[n—+(1—a)pyln], 220, 


with an appropriate initial condition. For example, if juy-[—1] = 0 and py [n] = py, a given 
constant, then the solution is 


a) yy uln). 
We can also use this method to calculate the cross-correlation function between input and 
output. First we conjugate Equation 8.3-12, then multiply by W[m], and finally take the 
expectation to yield, for a real, 


Rwy|m,n] = aRwy|m,n—1)4+ (1 -—a)RwwI([m,n], (8.3-13) 


which can be solved directly for Rwy in terms of Rww. The partial difference equation for 
the output correlation Ryy is obtained by re-expressing Equation 8.3-12 as a function of 
m, multiplying by Y*[n], and then taking the expectation to yield 


486 Chapter 8 Random Sequences 


Ryy|m,n] = aRyy[m —1,n] + (1 -—a)Rwy[m, n]. (8.3-14) 


These two difference equations can be solved by the methods of Section 8.2 since they can 
each be seen to be one-dimensional difference equations with constant coefficients in one 
index, with the other index simply playing the role of an additional parameter. Thus, for 
example, one must solve Equation 8.3-13 as a function of n for each value of m in succession. 


8.4 WSS RANDOM SEQUENCES 


In this section we will assume that the random sequences of interest are all WSS, that is, 
(1) E{X([n]} = wx, a constant, 
(2) Rxx[k+m,k] = EL X[k + m]X"[A}} 
= Rxx(m, 


and of second order, that is, E{|X[n]|?} < oo. 
Some important properties of the autocorrelation function of stationary random 
sequences are presented below. They also hold for covariance functions, since they are just 


the autocorrelation function of the centered random sequence X,[n] 4x [n] — fy. 


1. For arbitrary m, |Rxx|m]| < Rxx[0] = 0, which follows directly from 
E{\|X[m] — X[0]|?} > 0 for X[n] real valued, otherwise use Schwarz inequality (cf. 
Equation 4.3-15). 

2. |Rxy[m]| < /Rxx([0])Ryy [0], which is derived using the Schwarz inequality. 

Be Rxx|m] = Ryy[-m] since Rxx|m] = E{X|m + X*[U]} = E{X([IX* (1 _ m]} = 
EM XI — m) X*[I]} = Rx x[-m. 


4. For all N > 0 and all complex aj, a2,...,an, we must have 
N ON 
ys x ana,Rx x(n — k] > 0. 
n=1Lk=1 


Property 4 is the positive semidefinite property of autocorrelation functions. It is a 
necessary and sufficient property for a function to be a valid autocorrelation function of a 
random sequence. In general it is very difficult to directly apply property 4 to test a function 
to see if it qualifies as a valid autocorrelation function. However, we soon will introduce an 
equivalent frequency domain function called power spectral density, which furnishes an easy 
test of validity. 

Many of the input-output relations derived in the previous section take a surpris- 
ingly simple form in the case of WSS random sequences and LSI systems described via 
convolution. For example, starting with 
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we obtain 


Rxy|m,n| = E{X|[m]Y*[n]} 


= SO h*[n—k)Rxx[m-— kl 


= S¢ A -YRxx[(m-—n)-I, withISk—-n, 


if the input random sequence X[n] is WSS. So, the output cross-correlation function 
Rxy(m,n] is shift-invariant, and we can make use of the one-parameter cross-correlation 


function Ryy[m] = Rxy|m, 0] to write 


Rxy[m = A*[-Rxx[m-]] 


l=—oo 
= h*|—m] * Rxx|ml, 
in terms of the one-parameter autocorrelation function Rx x|[m]. Likewise, recalling that 


Ryy[n+m,n] 2 E{Y[n + m]¥*[n]} 


+00 


= SO AfkE{X[n +m — k]Y*[n]} 
k=—co 

= 3 h[k]|Rxy[m — k] 
k=—oco 


=> h{m] cS Rxy|m), 


we see that the autocorrelation function of the output is shift-invariant, and so making use 


of the one-parameter autocorrelation function Ryy|[m] = Ryy|m, 0], we have 
Ryy[m] = Alm] * Rxy|m]. 
Combining both equations, we get 
Ryy([m] = h[m] * h*|-m] * Rxx[m] 
= (h[m] * h*|—m]) * Rxx|m] (8.4-1) 


=glm]* Rxx[m], — with g[m] 2 h[m] « h*[—m] 
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where g{m] is sometimes called the autocorrelation impulse response (AIR). Note that if 
the input random sequence is WSS and independent, then its autocorrelation function 
would be a positive constant times d[m], so that taking this constant to be unity, we would 
have the output autocorrelation function equal to g[m] itself. Therefore, g[m] must possess 
all the properties of autocorrelation functions, that is , g{l] = g*[—l], g[0] > g[l] for all J, 
and positive semidefiniteness. The AIR g depends only on the impulse response h of the 
LSI system; however, in the absence of other information, we cannot uniquely determine h 
from g. In astronomy, crystallography, and other fields the problem of estimating h from 
the AIR is an important problem known by various names including phase recovery and 
deconvolution. 


Example 8.4-1 
(impulse response) We cannot in general calculate the impulse response from the AIR. 
To show this, first take the Fourier transform of g[m] to obtain G(w) = H(w)H*(w) = 
|H(w)|?. Then note that |H(w)| = \/G(w). Thus the phase of H(w) is lost in the AIR, but 
the magnitude of H(w) is preserved. Often there is some information available that can 
narrow down or possibly pinpoint the phase, for example, the support of h[n] in an image 
application, or causality for a time-based signal. For the interested reader, the literature 
contains many articles on this subject; see for example [8-6]. 


Example 8.4-2 
(correlation function analysis of the edge detector using impulse response) In the edge 
detector of Example 8.3-2, the linear transformation was given as 


Y[n] = L{X[n]} 2 X[n] — X[n— 1, 


an LSI operation with impulse response h[n] = 6[n] — 6[n — 1], and input autocorrelation 
function Rx x[m] = a!™, with |a| < 1. We can easily calculate the AIR as 


gl] = h[m] * h{—m] 


= 26[m] — d[m — 1] — 6[m + 1]. 
We then calculate the output autocorrelation function in this WSS case as 


Ryy|m] = g[m] * Rx x[m] 


= (26[m] — 6[m — 1] — 6[m+1]) «al™! 
=%q'™l—gim-U_gim™ for —co < m < +00, 


which agrees with the answer in Example 8.3-2, where the result was plotted for a = 0.7. 
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Power Spectral Density 


We define power spectral density (psd) as the FT (cf. Definition 8.2-3) of the one-parameter 
discrete-time autocorrelation function of a WSS random sequence X [n]: 


Sx x(w) 4 >. Rxx[mjexp(—jwm), for -t<w<-+m. (8.4-2) 


Now by taking the FT of Equation 8.4-1, we get the following important psd input /output 
relation for an LSI system excited by a WSS random sequence: 


Syy(w) = |H(w)|?Sxx(w) = G(w)Sxx(w), (8.4-3) 


where the various frequency-domain quantities are discrete-time Fourier transforms. 
Equation 8.4-3 is a central result in the theory of WSS random sequences in that it enables 
the computation of the output psd directly from knowledge of the input psd and the transfer 
function magnitude. By using the [F'T, we can calculate the autocorrelation function as 
yee : 
Rxx|m| = IFT {Sxx(w)} = ——— Sxx(w)ei?"™ dw, 

—T 
so that knowledge of the psd implies knowledge of the autocorrelation function. 

As to the name power spectral density, note that Rx x [0] = E{|X|n]|?} is the ensemble 
average power in X[n] and so by the above relation, we see that 


+7 


E{|X[n]|?} = Rxx(0] = ~ Sxx(w)dw, 


TT 


so that the integral average of the psd over its frequency range [—7,+7] is indeed average 
power. To pursue this further, we consider a WSS random sequence X[n] input to an LSI 
system consisting of a narrow band filter H(w), with very small bandwidth 2e, centered at 
frequency w,, where |w,| < 7, and with unity passband gain. Writing Sx (w) for the input 
psd, we have for the output ensemble average power, approximately 


L er € 
Ryy [0] = — Sxx(w) dw ~ Sxx(wo)- 
vr) = 52 f Sxx(u) de Sxx (un) 
thus showing that Sx x(w) can be interpreted as a density function in frequency for ensemble 
average power. 
Some important properties of the psd are given below: 


1. The function Sx x(w) is real valued since Rx x[m] is conjugate-symmetric. 

2. If X[n] is a real-valued random sequence, then Sx x (w) is an even function of w. 

3. The function Sxx(w) > 0 for every w, whether X[n] is real- or complex-valued. 

4. If Rxx|m] = 0 for all |n| > N for some finite integer N > 0 (i-e., it has finite 
support), then Sxx(w) is an analytic function in w. This means that Syx(w) can 
be represented in a Taylor series given its value and that of all its derivatives at a 
single point wo. 
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Since Sx x(w) is the Fourier transform of a (autocorrelation) sequence, it is periodic with 
period 27. This is why the inverse Fourier transform, which recovers the autocorrelation 
function, only integrates over [—a,+7], the primary period. We also define the Fourier 
transform of the cross-correlation function of two jointly WSS random sequences: 


Sxy(w S Rxy|[mlexp(—jwm), for —at<w<-4n, 


m>=—oco 


called the cross-power spectral density between random sequences X and Y. In general, this 
cross-power spectral density can be complex, negative, and lacking in symmetry. Its main 
use is as an intermediate step in calculation of psd’s. 


Interpretation of the psd 


From its name, we expect that the psd should be related to some kind of average of the 
magnitude-square of the Fourier transform of the random signal. Now since a WSS random 
signal X[n] has constant average power Rx x|0] for all time, we cannot define its FT; 
however, we can define the transform quantity 


Xw(w) 2 FT {wy [nr] X[n}} 


with aid of the rectangular window function 


A Jl, [n|<N, 
wn] = 0, else. 


Then, taking the expectation of the magnitude square |X,y(w)|?, and dividing by 2N +1, 
we get 


1 1 +N +N 
snp Ele )I"} ON ae ey tx [7] exp(—jwk) oats) 
1 +N +N 
og S> So EL X[k] X*[I]} exp(—jwk) exp(+jwl) 
k=—N l=—N 
1 +N +N 
= oN a1 py x Rxx[k — I] exp[—jw(k — 1)] 
—Nl=—N 
+2N 
_ |m| ; 
= So Rel (1~ A) exp(—siom), 


where the last line comes from the fact that Ry x[k—l] is constant along diagonals k—1 = m 
of the (2N + 1) x (2N +1) point square in the (k,/) plane. 
Now as N — ov, the triangular function (1— sth ) has less and less effect if |Rx x[m]| > 


0 as |m| — oo, as it must for the Fourier transform, that is, Sxx(w) to exist. In fact, 
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if we assume that |Rx x[m]| decays fast enough to satisfy 77°. |m||Rxx[m]| < 00, then 
we have 
1 
= lim ———_E{|X *}. 4-4 
Srex(w) = im 5 * S B{IXw(u)??} (8.4.4) 


In words we have that the ensemble average of the power at frequency w in the windowed 
random sequence is given by the psd Sx x(w). Note that we have said nothing about the 
variance of the random variable ayaa lx n(w)|?, but just that its mean value converges 
to the psd. In the study of spectral estimation (cf. Section 11.6), it is shown that the 
variance does not get small as N gets large, so that anu lx n(w)|? cannot be considered a 
good estimate of the psd without first doing some averaging. In the language of statistics 
(Chapter 6) we say that (2N + 1)~!|Xy(w)|? is not a consistent estimator for Sx x(w). 


Example 8.4-3 
Here is a MATLAB m-file to compute the psd’s of the random sequences with memory in 
Example 8.1-16 for p = 0.8, 0.5, and 0.2. 


function [psd1, psd2, psd3]=psdmarkov2(N,p1,p2,p3)} 

mc1=O0*ones(1,N); 

mc2=0x*ones(1,N); 

mc3=0xones(1,N); 

for i=1:N 

mc1(i)=0.25*(((-1)*(2*p1-1))*(i-1));% The (-1)*(i-1) factor shifts the 
spectrum to yield 

mc2(i)=0.25*(((-1) *(2*p2-1))*(i-1));%an even function of frequency. 
Otherwise 

mc3(i)=0.25*(((-1)*(2*p3-1))*(i-1));%the highest frequency 
components appear 


end 

x=linspace(-pi,pi,N);/at pi and the lowest at 2*pi. 

psdi=abs (fft(mc1)) ; 

psd2=abs (fft (mc2)) ; 

psd3=abs (fft (mc3) ) ; 

plot (x,psd1,x,psd2,x,psd3) 

title(’Power spectral density (psd) of random sequences with memory’ ) 
xlabel(’radian frequency’ ) 

ylabel(’psd value’) 

end 


See the three plots in Figure 8.4-1. 


Example 8.4-4 
A stationary random sequence X|[n]| has power spectral density Sx x(w) = Now(3w/47), 
where the rectangular window function w is given as 
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Figure 8.4-1 Power spectral densities of three stationary random sequences with memory. 


4 J1, |e] <1/2, 
we) = ti else. 


It is desired to produce an output random sequence Y [n] with the psd Syy(w) = Now(w/z). 
An LSI system (not necessarily causal) with impulse response h[n] is proposed. Which of 


the following impulse responses should be used? (Note that sinc(z) 4 sin(7x)/12.) 
(a) 2sine(n/2), 


(b) saine((n — 10)/2), 


(c) 1.5e7u[n], 

(d) u[n + 2] — ul[n — 2] 

(e) (L—|n|)w(n/2). 
Solution Clearly what is needed is an H(w) with transfer-function magnitude |H(w)| = 
w(w/7). Choices (c) through (e) are ruled out immediately because their Fourier transforms 


do not have constant magnitude inside any frequency band. Since the [FT of w(w/z) is 
5 sinc(n/2), we choose (b) since its 10-sample delay does not affect the magnitude |H(w)|. 


3 


A useful summary of input/output relations for random sequences is presented in 
Table 8.4-1. 
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Table 8.4-1_ Input/Output Relations for WSS Sequences and Linear 


Systems 
Random Sequence: Output Mean: 
Y[n] = A[n] * X[n] by = H(0)ux 
Crosscorrelations: Cross—Power Spectral Densities: 
Rxy[m|] = Rxx {mJ *« h*|—m] Sxy (w) = Sxx(w)H*(w) 
Ryx([m|] = h|m] * Rxx|m] Sy x (w) = H(w)Sxx(w) 
Ryy|m] = Ryx|m| * h*|—m] Syy(w) = Syx (w) H* (w) 
Autocorrelation: Power Spectral Density: 
Ryy|m] = h[m] « h*[-m] * Rxx[m] Syy (w) = |H()/?Sxx() 


= g{m] * Rxx[m| = G(w) Sxx(w) 


Output Power and Variance: 
E{\Y [n]/?} = Ryy [0] = & S77 |H()/?Sxx(w)dw 


oy = Ryy(0] — |nyI? 


Synthesis of Random Sequences and Discrete-Time Simulation 


Here we consider the problem of finding the appropriate transfer function H(w) to generate 
a random sequence with a specified psd or correlation function. Consider Equation 8.2-2, 
repeated here for convenience: 


M N 
yln] = Do axy[n — k] + SO dea[n — ki], (8.4-5) 
k=1 k=0 


where the coefficients are real valued. The transfer function H(w) is given by 


Yw) _ Be) 
Be) = Xa) = Aw)’ 


where B(w) & re One 2* and A(w) Si- Soi, ape7”*, When driven by a white-noise 
sequence, W[n], with power E{|W[n]|?} = o7,, the output psd, Syy(w), is given by 


B(w)B*(w) 9 


Alw)A*(w) Ow: (8.4-6) 


Syy (w) = oui = 


= 
€ 


)P 


Now, recalling that B(z) S B(w) at z = e)” and similarly for A(z), H(z), that e7” = 271, 
and that B*(e!”) = B(e~J”), we obtain an LCCDE with real coefficientst 


+Only when, as here, the impulse response coefficients are real valued. This is true here since the 
numerator and denominator coefficients are real numbers. 
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SiS oR = H(z)H(z-")o2,, (8.4-7) 


where up to this point we have confined z to the unit circle. For the purpose of further 
analysis, it is of interest to extend Equation 8.4-7 to the whole z-plane. 

This last step is called analytic continuation and simply amounts to finding a rational 
function of z which agrees with the given psd information on the unit circle z = e/”. 

Given any rational Sy x(z), that is, one with a finite number of poles and zeros in the 
finite z-plane, one can find such a spectral factorization as Equation 8.4-7 by defining H(z) 
to have all the poles and zeros that are inside the unit circle, {|z| < 1}, and then H(z~') 
will necessarily have all the poles and zeros outside the unit circle, {|z| > 1}. 


Example 8.4-5 
Consider the psd 


2 
Bex (ail ul 


= ith |p| <1. 

1 — 2pcosw + p? wna 
We want to first extend Sx x(w) to all of the z-plane. Now cosw = (ers +e 3”), which 
can be extended as $(z + 27") and satisfies the symmetry condition Sxx(z) = Sxx(z7') 
of a real-valued random sequence. Then 


_ ow 
~Tope+ e+e 


2 
ow 


(1 — pz)(1 — pz") 
= ofyH(z)H(z~*) 


1 
for |p| < jz} << — 


lel’ 


Sxx(z) 


1 
with H(z) = cs for region of convergence |p| < |z]. 
Zz 


Since |p| < 1, the region of convergence (ROC) includes the unit circle and so H is 
both stable and causal. Indeed the system with h[n] = p”u[n] will yield Sxx(w) from an 
independent sequence. 


If a zero occurs on the unit circle, then it must be of even order, since otherwise one 
can easily show that Sx x(e?”) must go through zero and hence be negative in its vicinity. 
Thus, we can assign half the zeros to H(z) and the other half to H(z~'). Since H(z) contains 
only poles inside the unit circle, it will be BIBO stable [8-5]. Except in the case of a zero 
on the unit circle, its inverse will also be stable. The other factor H(z~!) has all its poles 
outside the unit circle, so it is stable in the anticausal sense. Denoting the largest pole 
magnitude inside the unit circle by pmax, we thus have that Sx x(z) is analytic, that is, free 
of singularities in the annular region of convergence {pmax < |z| < 1/Pmax}-. 

Following the above procedures, we can obtain the system function H(z) that, when 
driven by a white noise W[n], will generate a random sequence X[n] with special psd 
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Sxx(w). This can be the basis for a discrete-time simulation on a computer. The white 
random sequence W |n] is easily obtained by using the computer’s random number generator. 
Then one specifies appropriate initial conditions and proceeds to recursively calculate X [n] 
using the LCCDE of the system function H(z). 

To achieve a Gaussian distribution for X, one could transform the output of the random 
number generator to achieve a Gaussian distribution for W, which would carry across to X. 
An approximate method that is often used is to average six to ten calls to the random 
number generator to obtain an approximate Gaussian distribution for W via the Central 
Limit theorem. When simulating a non-Gaussian random variable, the distribution for X 
and W is not the same. Thus the preceding method will not work. One possibility is to use 
the LCCDE to generate samples of W[n] from some real data and then use the resulting 
distribution for W[n] in the simulation. 


Example 8.4-6 
(matching given correlation values) In order to simulate a zero-mean random sequence with 
average power Rx x [0] = o? and nearest neighbor correlation Rx x[1] = po”, we want to 
find the parameters of a first-order stochastic difference equation to achieve these values. 
Thus consider 


X(n] = aX[n — 1] + bW[n], (8.4-8) 


where W[n] is a zero-mean white-noise source with unit power. Computing the impulse 
response, we get 
hin] = ba” u[n] 


and the corresponding system function 
b 
H(z) = ————.. 
(2) 1—az-1 
Since the mean is zero, we calculate the covariance of the output X[n] of Equation 8.4-8: 
= h|m] « h[-m] 
= 0 (a u[m]) * (a7 u[—m]) 
-+-0o 
= S- a®ulkla™** ulm + k] 
k=—0o 
+00 
= b2a™ a azk 
k=max(0,—m) 
2 
bY aim 


Toa : —oo <m < +00. 
—a 
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From the specifications at m = 0 and m = 1, we need 


Kxx(0] = o = b?/(1 — a’), 


Kxx[1] = po? = ab?/(1 — a’). 


Thus, 
a=pand b?=o7(1 


=p). 


To compute the resulting psd, we use Equation 8.4-4 to get 


b2 
Sxxe) = TT getop 
~ e=,") 
~ 1—2pcosw + p2" 


Example 8.4-7 
(decimation and interpolation) Let X[n] be a WSS 


random sequence. We consider what 


happens to its stationarity and psd when we subject it to decimation or interpolation as 


occur in many signal processing systems. 
Decimation 


Set Y[n] 4 X([2n], called decimation by the factor 2, 


thus throwing away every odd indexed 


sample of X[n] (Figure 8.4-2). We easily calculate the mean function as j1y-[n] 4 E{Y[n]} = 


lation, 


+ 2m]. X*[2n}} 


E{X[2n]} = wx[2n] = wx, a constant. For the corre 
Ryy|[n+ m,n] = E{X[2n 4 

—_ Rxx(|2n4 

= Rx x [2m] 


thus showing that the WSS property of the original 
the decimated random sequence. The psd of Y[n] ca 


+00 
Svyy(w) = > Ryvy[m]exp[—jwm] 


+00 
a = Rx x[2m] exp [—jwm] 


m=—oo 


+ 2m, 2n] 


random sequence X[n] is preserved in 
n be computed as 


a > Rxx[m] exp [-s5m| = > Rxx|m] exp [-s<m] (-1)”. 


2 


m even m even 
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X{[n] 


Y[n] 


Figure 8.4-2 In decimation every other value of X[n] is discarded. 


Figure 8.4-3 In interpolation, the expansion step inserts zeros between adjacent values of the X{[n] 
sequence, to get the expanded sequence X.[n]. 


Now, define A, = Syy(w), and A, = Yom oda Rxx[m] exp[—j $m]. Then, clearly Ae+A, = 


Sxx($) and A, — A, = Sxx(45%*), so that 


Syy(w) = ; [sxx (5) +OXX (-5*)| , 


which displays an aliasing [8-5] of higher-frequency terms. 


Interpolation 


For interpolation by the factor 2, we do the opposite of decimation. First we perform an 
expansion by setting 
A { X[2], n= even 
XxX, = oa 
eln] { 0, n=odd. 


The resulting expanded random sequence is clearly nonstationary, because of the zero 
insertions. See Figure 8.4-3. Formally the psd of X,[n] doesn’t exist since the psd is defined 
only for WSS sequences (Figure 8.4-4). We encounter such problems with a broad class 
of random sequences and processes’ classified as being cyclostationary (cf. Section 9.6) to 
which X,[n] belongs. It is easy to convert such sequences to WSS by randomizing their 
start times and then averaging over the start time (Example 9.6-1). However, here we 
instead compute the power spectral density using Equation 8.4-4, which is permissible for 
cyclostationary waveforms. Thus we write 


+ Random processes are continuous-time random waveforms to be discussed in Chapter 9. 


498 Chapter 8 Random Sequences 


TT Tv 


(a) 


Figure 8.4-4 (a) The original psd of X[n]; (b) the psd of X.[n] (not drawn to scale). Note the “leakage” 
of power density from the secondary periods into the primary period. An ideal lowpass filter with support 
[—2/2, 7/2] will eliminate the contribution from the secondary periods. 


N 2 


SX; [nen 


n=—N 


E{|Yy(w)|?} = £ 


and take the limit of E{\XY (w) 24 /(2N + 1) as N > o. This quantity can be interpreted 
as the psd, Sx.x,(w), of the random sequence X-[n]. If the algebra is carried out and we 
assume that Rx x[m] is absolutely summable, we find that Sx,x,(w) = $Sxx(2w). For 
further discussion of the expansion step, see Problems 8.58 and 8.59. 

Next we put X,[n], sometimes called an upsampled version of X[n], through an ideal 
lowpass filter with bandwidth [-$,+4] and gain of 2, to produce the “ideal” interpolated 
output Y[n] as 

Y [n] = Afr] * Xn]. 


The impulse response of such a filter is 


_ sin(7n/2) 
My 
Thus, 7 
Yin)= So xg 
k=—0o 


mi sin(n — 2k)a/2 
De cali (n —2k)r/2 ~ 


k=—0o 


First we calculate the mean function of Y[n] 


py [n] 2 ELY[n]} 


— sin(n — 2k)7/2 
~2{ $ xyeueszeer| 


k=—0o 


3 


—s in(n — 2k)7/2 
= px 


+oo % 
_ sin(n — 2k)m/2 
=Hx DL (n—2k)n/2 ’ 
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the last step being allowed since jr is a constant. Now sampling theory can be used to show 
that the infinite sum is 1, so that py[n] = wx. To see this we write the sampling theorem 
representation for an arbitary bandlimited function g(t) and sampling period T = 2 as 
in [8-5]: 


+oo é 
at) = > oan) EE (8.49) 


k=—0o 


Then we simply choose t = n for the bandlimited function g(t) = 1 with zero bandwidth to 
see that 


i- y sin(n — 2k)m/2 


om (n—2k)nr/2 ~ 
For future reference we define h(t) S so 2. To find the correlation function, we 


proceed to calculate 


Ryy|[n+m,n| = E{Y [n+ m]Y*[n]} 


+00 
So E{X [ki] X* [ka] }h[n + m — 2ky][h[n — 2ko] 


= SO Rxx{h] So hlntm-h -b)h[n+h -b] 
1; =even l2=even 
+ $0 Rxxlh] So) h[ntm—h -bjAln+h -b] 
1,=0dd lg=odd 


with 1, 2 ky — ko and lg = ky + ko and lg +1, even. We can evaluate the sums 


So Aln+m—h -bjh[nth -b] 


lg=even or odd 


by letting g(t) = A(t) in Equation 8.4-9 and allowing t to take the value t = m. We find 
that each sum, both the even and odd, equals h[m — 2l,]. Thus, 


Ryy|m+n,n] = Ryy[m] = SO Rxx[hJhlm — 2h]. 
ly 


We thus see that Y[n] is WSS, that Ryy|[m] interpolates Rx x[m], that is, 
Ryy[2m] = S> Rxx([hJh[2m — 2h] 
ly 


— Rxx(|2mI, 
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and calculating the psd 
Syy(w) = Se Ryy[mje"™ 


im = 3 Rxx(hh[m — 2h,Je7"™ 


m dy 


=> Rxx{h] $0 lm -2hje 
ly m 


= S Rxx([h]H(w)e 748 
qy 


= H(w)Sxx(2w) 
2Sxx(2w), |w| < 7/2, 
_ 0, 3*< lw] <7" 


8.5 MARKOV RANDOM SEQUENCES 


We have already encountered some examples of Markov random sequences. Such sequences 
were loosely said to have a memory and to possess a state. Here we make these concepts 
more precise. We start with a definition. 


Definition 8.5-1 (Markov random sequence) 


(a) A continuous-valued Markov random sequence X|n], defined for n > 0, satisfies the 
conditional pdf expression 


fx (Cn4klCn, Tn-1y+++ , 0) = Fetes Pa) 


for all %o,...,2%n,%n+k, for all n > 0, and for all integers k > 1. 
(b) A discrete-valued Markov random sequence X|n], defined for n > 0, satisfies the 
conditional PMF expression 


Px (2n+k|@n,---;20) = Px(tn+kl2n) 
for all %o,...,%n,%n+k, for alln >0, and foralk>1. 


It is sufficient for the above properties to hold for just k = 1, which is the so-called 
one-step case, as the general property can be built up from it. The discrete-valued Markov 
random sequence is also called a Markov chain and will be covered in the next section. Here 
we consider the continuous-valued case. 

To check the meaning and usefulness of the Markov concept, consider the general Nth- 
order pdf fx(ay,Un_-1,---,20) of random sequence X, and repeatedly use conditioning to 
obtain the chain rule of probability 


fx (x0, %1,---,2N) = fx (0) fx (2120) fx (w2|21, 20)... fx(@w|@n-1,---,%o). (8-5-1) 
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Now substitute the basic one-step (& = 1) version of the Markov definition to obtain 


fx (ao, 21,...,¢n) = fx (xo) fx (@1|20) fx (we2|#1) ... fx (ew |en-1) 


= fx(x0) ‘x (tx |Te—1) 


= 


Next we present two examples of continuous-valued Markov random sequences which 
are Gaussian distributed. 


Example 8.5-1 
(Gauss Markov random sequence) Let X|[n] be a random sequence defined for n > 1, with 
initial pdf 

fx(x;0) = N(O, 09) 


for a given 09 > O and transition pdf 
fx (@n|@n-13 nr, — 1) ie N(ptn-1, oi) 


with |p| < 1 and ow > 0. We want to determine the unconditional density of X[n] at an 
arbitrary time n > 1 and proceed as follows. 

In general, one would have to advance recursively from the initial density by performing 
the integrals (cf. Equation 2.6-84) 


+00 
ce / fx(alésn,n—1)fx(Gin— Dae (8.5-2) 


for n = 1,2,3, and so forth. However, in this example we know that the unconditional 
first-order density will be Gaussian because each of the pdf’s in Equation 8.5-2 is Gaussian, 
and the Gaussian density “reproduces itself” in this context; that is, the product of two 
exponential functions is still exponential. Hence the pdf fx (a;n) is determined by its first 
two moments. We first calculate the mean function 


Lx[n] = E{X[n]} 
= E[E{X[n]|X[n — 1]}] 
= E[pX[n— 1]] 
= pix[n— 1], 


where the outer expectation is over the values of X[n — 1]. We thus obtain the recursive 
equation 

Hx[n] =pux(n—-Y, ne, 
with prescribed initial condition jy [0] = 0. Hence y(n] = 0 for all n. 


We also need the variance function 0%[n], which in this case is just E[X?[n]] since the 
mean is zero. Calculating, we obtain 
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or 
o%([n] = p?o%[n — 1] + 0%, n> 1. 


This is a first-order difference equation, which can be solved for ¢%[n] given the condition 


o% [0] = 0 supplied by the initial pdf. The solution then is 


oX[n] = [L+ p? + ptt... +P" MYaiy + p05 


1 2 
1_-,pow as nl — ©. 


Example 8.5-2 
(Markov difference equation) Consider the difference equation 


X([n] = pX[n—-1]) + W[n], 


where W[n] is an independent random sequence (cf. Definition 8.1-2). Let n > 0; then 


fx(fn,En—1,--+,L0) = fx (tn|%n—1) Fx (En—1|En—2) --- fx (@1|L0) fw (Zo) 
= (11 fw(az - re.-)) fw (20), 
k=1 


where z[n] = x, and w[n] = wy, are the sample function values taken on by the random 
sequences X[n] and W[n], respectively. Clearly X[n] is a Markov random sequence. If W[n] 
is an independent and Gaussian random sequence, then this is just the case of Example 8.5-1 
above. Otherwise, the Markov sequence X[n] will be non-Gaussian. 


The Markov property can be generalized to cover higher-order dependence and higher- 
order difference equations, thus extending the direct dependence concept to more than one- 
sample distance. 


Definition 8.5-2 (Markov-p random sequence) Let the positive integer p be called the 
order of the Markov-p random sequence. A continuous-valued Markov-p random sequence 
X(n], defined for n > 0, satisfies the conditional pdf equations 


Peek Php Lams sgh) — fx Papel tasiysit ola) 
forallk >1andforaln>p. 


Returning to look at Equation 8.5-1, we can see that as the Markov order p increases, the 
modeling error in approximating a general random sequence by a Markov random sequence 
should get better. 


Sec. 8.5. MARKOV RANDOM SEQUENCES 503 


Fx (to, %1,.-+,2n) 
= fx (xo) fx (x1|x0) fx (v2|21, 20)... fx (@n|@n-1,---, Zo) 
m fx (to) fx (#1|00) fx (£2|"1, £0)... fx (fp|%p_1,..., £0) 
N 
x II Fees ibs memes meee. 
k=p+l1 


This approximation would be expected to hold for the usual case where the strongest 
dependence is on the nearby values, say X[n — 1] and X[n — 2], with the conditional depen- 
dence on far away values being generally negligible. When the Markov-p model is used in 
signal processing, one of the most important issues is determining an appropriate model 
order p so that statistics like the joint pdf’s (Equation 8.5-1) of the original data are 
adequately approximated by those of the Markov-p model. In Chapter 11 on applications in 
statistical signal processing, we will see that Markov-p random sequences are quite useful in 
modern spectral estimation. The celebrated Kalman filter for the recursive linear estimation 
of distorted signals in noise is based on the Markov models. 


ARMA Models 


A class of linear constant coefficient difference equation models are called ARMA for auto- 
regressive moving average. Here the input is an independent random sequence W[n] with 
mean [yy = 0 and variance 0%, = 1. The LCCDE model then takes the form 


M L 
X[n] = S/ an X[n— k] + SW In — Ki. 
k=1 k=0 


If the model is BIBO stable and —co < n < +00, then a WSS output sequence results 


ith psd 
with p : 2 


S> br exp(—jwk) 
a = 0 , 
XX (w) M 2 
1- y ap exp(—jwk) 

k=1 

The ARMA sequence is not Markov, but when L = 0, the sequence is Markov-M, and 
the resulting model is called autoregressive (AR). On the other hand when M = 0, that is, 
there are no feedback coefficients c;,, the equation becomes just 


L 
X[n] =) > d,W[n — ky, 
k=0 


and the model is called moving average (MA). The MA model is often used to estimate the 
time-average value over a data window, as shown in the next example. 
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Example 8.5-3 
(running time average) Consider a sequence of independent random variables W[n] on 
n > 1. Denote their running time average as 


fiw (n] = ~ 5° WIA. 
k=1 


Since we can write fiyy|[n] equivalently as satisfying the time-varying AR equation, 


a n—-1 


Ryn] = "Ay ln — 1] + Win, 


it follows from the joint independence of the input W[n] that jiy,[n] is a nonstationary 
Markov random sequence.t 


Markov Chains 


A Markov random sequence can take on either continuous or discrete values and then be 
represented either by probability density functions (pdf’s) or probability mass functions 
(PMFs) accordingly. In the discrete-valued case, we call the random sequence a Markov 
chain. Applications occur in buffer occupancy, computer networks, and discrete-time approx- 
imate models for the continuous-time Markov chains (cf. Chapter 9). 


Definition 8.5-3 (Markov chain) A discrete-time Markov chain is a random sequence 
X(n] whose Nth-order conditional PMFs satisfy 


Px(z[n]|z[n — 1],...,2[n — N]) = Px(z[n]|z[n — 1)) (8.5-3) 
for all n, for all values of a[k], and for all integers N>1. 


The value of X[n] at time n is called “the state.” This is because this current value, 
that is, the value at time n, determines future conditional PMF's, independent of the past 
values taken on by X[n]. 

A practical case of great importance is when the range of values taken on by X[n] is 
finite, say M. The discrete range of X[n], that is, the values that X takes on, is sometimes 
referred to as a set of labels. The usual choices for the label set are either the integers 
{1, M}, or {0, M — 1}. Such a Markov chain is said to have a finite state space, or is simply 
a finite-state Markov chain. In this case, and when the random sequence is stationary, we 
can represent the statistical transition information in a matrix P with entries 


Pig = Pin] xin—y] J): (8.5-4) 


for 1 < i, 7 < M. The matrix P is referred to as the state-transition matria. Its defining 
property is that it is a matrix with nonnegative entries, whose rows sum to 1. Usually, and 


tNote that the variance of fy; [n] decreases with n. 
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again without loss of generality, we can consider that the Markov chain starts at time index 
n = 0. Then we must specify the set of initial probabilities of the states at n = 0, that is, 
Px (i;0),1<%< M, which can be stored in the initial probability vector p[0], a row vector 
with elements (p[0]); = Px(#;0),1<i< M. 

The following example re-introduces the useful concept of state-transition diagram, 
already seen in Example 8.1-15. 


Example 8.5-4 
(two-state Markov chain) Let M = 2; then we can summarize transition probability infor- 
mation about a two-state Markov chain in Figure 8.5-1. The only addition needed is the set 
of initial probabilities, Px (1;0) and Px (2;0). 


Possible questions might be: Given that we are in state 1 at time 4, what is the prob- 
ability we end up in state 2 at time 6? Or given a certain probability distribution over the 
two states at time 3, what is the probability distribution over the two states at time 7? 
Note that there are several ways or paths to go from one state at one time to another state 
several time units later. The answers to these questions thus will involve a summation over 
these mutually exclusive outcomes. 

Here we have M = 2, and the two-element probability row vector p[n] = (po[n], pi[n)). 
Using the state-transition matrix, we then have 


p[!] = p[o|P 
p[2] = p[t]P 
= p[0|P? 
or, in general, 
pln] = p[o]/P” 


In a statistical steady state, if one exists, we would have 


P[co] = p[oo]P, where p[oo] = lim p[n]. 


n—-oo 


Writing p S p|co], we have p(I — P) = 0, which furnishes 17 — 1 independent linear 
equations. Then with help of the additional equation p1 = 1, where 1 is a size M column 


Py2 
P14 
C bd P22 


Figure 8.5-1 The state-transition diagram of a two-state Markov chain. 
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vector of all ones, we can solve for the M values in p. The existence of a steady state, or 
equivalently asymptotic stationarity, will depend on the eigenvalues of the state-transition 
matrix P. 


Example 8.5-5 
(asymmetric two-state Markov chain) Here we consider an example of a two-state, asym- 
metric Markov chain (AMC), with state labels 0 and 1, and state-transition matrix, 


p =| P00 Por} _ 0.9 0.1 
P10 Pil 0.2 0.8 , 


See Figure 8.5-2. 
Note that in this model there is no requirement that poo = pi, and the steady-state 
probabilities, if they exist, are given by the solution of 


p[n+ 1] = p[n|P, (8.5-5) 


if we let n — co. Denoting these probabilities by po[oo] and p,[oo], and using po[oo] + 
pi [oo] = 1, we obtain 


po [oo] = _i-pu 
2— poo — Pi’ 

_ _1=Doo 
pi [co] = a Se 
— Poo — P11 


which, using the P matrix from Example 8.5-4, yields po[oo] = 3 and pi[oo] = $3. 


The steady-state autocorrelation function of the AMC of this example can be computed 
from the Markov state probabilities. For example, assuming asymptotic stationarity, 


Rxx|m] = P{X[k] =1,X|[m+k] =1} for sufficiently large k 
= P{X[k] = 1}P{X|[m-+ k] = 1] X[k] = 1} (8.5-6) 
= pyloo] P{X[m] = 1,X[0] = 1}, 


<f ' 


Figure 8.5-2 State-transition diagram of general (asymmetric) two-state Markov random sequence, 
with state labels 0 and 1. 
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where the last factor is an m-step transition from state 1 to state 1. It can be computed 
recursively from Equation 8.5-5, with the initial condition p[0] = [0,1]. The needed compu- 
tation can also be illustrated in a trellis diagram as seen in the following example. 


Example 8.5-6 
(trellis diagram for Markov chain) Consider once again Example 8.1-15, where we intro- 
duced the state-transition diagram for what we now know as a Markov chain. Another 
useful diagram that shows allowable paths to reach a certain state, and the probability of 
those paths, is the trellis diagram, named for its resemblance to the common wooden garden 
trellis that supports some plants. See Figure 8.5-3 for the two-state case having labels 0 and 
1, which also assumes symmetry, that is, pi; = pj; We see that this trellis is a collapsing 
of the more general tree diagram of Example 8.1-4. The collapse of the tree to the trellis is 
permitted because of the Markov condition on the conditional probabilities, that serve as 
the branch labels. 

Each node represents the state at a given time instant. The node value (label) is its 
probability at time n. The links (directed branches) denote possible transitions and are 
labeled with their respective transition probabilities. Paths through the trellis then repre- 
sent allowable multiple time-step transitions, with probability given as the product of the 
transition probabilities along the path. 

If we know that the chain is in state one at time n = 0, then the modified trellis 
diagram simplifies to that of Figure 8.5-4, where we have labeled the state 1 nodes with 


p p p p 
State 1 


LEK 


pol] pol 1] Pol2] Pol3] Pol 4] Pol5] 
n=0 n=1 n=2 n=3 n=4 n=5 


Figure 8.5-3 A trellis diagram of a two-state symmetric Markov chain with state labels 0 and 1. Here 


p;[n] is the probability of being in state / at time n. 
ao P 
p 


X(0]=1 


State 0 
p 


e--— 


yore 


4 5 


Figure 8.5-4 Trellis diagram conditioned on X[0] = 1. 
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P, 2 P{X |n] = 1|X[0] = 1}, and we can use this trellis to calculate the probabilities 
P{X |n] = 1|X[0] = 1} needed in Equation 8.5-6. The first few P,, values are easily calculated 
as P, = p, P2 = p? + ¢°, P3 = p? + 3pq’, etc. For the case po[oo] = py[oo] = $, and p = 
0.8, the asymptotically stationary autocorrelation (ASA) function Rxx|[m] then becomes 
Rxx(0] = 0.5, Rxx[+1] = 0.4, Rxx [+2] = 0.34, Rxx [+3] = 0.304, and so forth.t 


The trellis diagram shows that, except in trivial cases, there are many allowable paths 
to reach a certain node, that is, a given state at a given time. This raises the question of 
which path is most probable (most likely) to make the required multistep traversal. In the 
previous example, and with p > gq, it is just a matter of finding the path with the most 
p’s. In general, however, finding the most likely path is a time-consuming problem and, if 
left to trial-and-error techniques, would quickly exhaust the capabilities of most computers. 
Much research has been done on this problem because of its many engineering applications, 
one being speech recognition by computer. In Chapter 11, we discuss the efficient Viterbi 
algorithm for finding the most likely path. 


Example 8.5-7 
(buffer fullness) Consider the Markov chain as a model for a communications buffer with 
M +1 states, with labels 0 to M indicating buffer fullness. In other words, the state label is 
the number of bytes currently stored in the M byte capacity buffer. Assume that transitions 
can occur only between neighboring states; that is, the fullness can change at most by one 
byte in each time unit. The state-transition diagram then appears as shown in Figure 8.5-5. 


If we let M go to infinity in Example 8.5-7, we have what is called the general birth— 
death chain, which was first used to model the size of a population over time. In each time 
unit, there can be at most one birth and at most one death. 


Solving the equations. Consider a two-state Markov chain with transition probability 


P= be o| ; 
Pio P11 


matrix 


Figure 8.5-5 Markov chain model for M+ 1 state communications buffer. 


+The ASA is computed as Rx x[m] = E{ X[k+m ]X[k]}, where k — oo. For levels of 0 and 1, Rx x[m] = 
P{X[m +k] = 1|X[k] = 1} x 0.5. Then clearly Rxx[0] = 1 x 0.5 = 0.5, Rx x[1] = 0.8 x 0.5 = 0.4, 
Rxx[2] = [(0.8)? + (0.2)?] x 0.5 = 0.34, ete. 
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We can write the equation relating p[n] and p[n + 1] then as follows: 


[poln +1], piln +1] = [polr], pilr]] a lk (8.5-7) 


This vector equation is equivalent to two scalar difference equations which we have to 
solve together, that is, two simultaneous difference equations. We try a solution of the form 


po[n] = Coz”, piln] = C2”. 


Inserting this attempted solution into Equation 8.5-7 and canceling the common term 
z”, we obtain 


Coz = Copoo + Cipro, 


Cyiz = Copor + Cipit, 


which implies the following necessary conditions: 


CL =Cy (22) =Co( Pot i 
P10 &—Pi1 


This gives a constraint relation between the constants Cop and C; as well as a necessary 
condition on z, the latter being called the characteristic equation 


(z — poo)(% — P11) — PioPo1 = 9. 


It turns out that the characteristic equation (CE) can be written, using the determinant 
function, as 
det(zI — P) =0. 


Solving our two-state equation, we obtain just two solutions z, and z2, one of which 
must equal 1. (Can you see this? Note that 1 — poo = poi.) The solutions we have obtained 
thus far can be written as 


poln] = Coz;', pi[n] = Co (=) 2Pji=1,2. 
P10 


Since the vector difference equation is linear, we can add the two solutions corresponding 
to the different values of z;, to get the general solution, written in row vector form 


p(n] — Ay E “A — P00 zy t Ag E 2 — P00 2B, 
P10 P10 


where we have introduced two new constants A, and Ag for each of the two linearly inde- 
pendent solutions. These two constants must be evaluated from the initial probability vector 
p|0] and the necessary conditions on the probability row vector at time index n, that is, 
Yio Pilr] = 1 for alln > 0. 
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Example 8.5-8 
(complete solution) Let 


pix ie ral , with p[0] = [1/2, 1/2], 


and solve for the complete solution, including the startup transient and the steady-state 
values for p[n]. 

The first step is to find the eigenvalues of P, which are the roots of the characteristic 
equation (CE) 


z—-09 —0.1 


det(zI — P) = det ( 02 2-08 


)=2-1L7e+07=0. 


This gives roots z; = 0.7 and zg = 1.0. Thus, we can write 


p[n] = Ci[1,-1] 0.7” + C2[1, 0.5] 1”. 


From steady-state requirement that the components of p sum to 1.0, we get Cg = 2. 


So we can further write 
p[n] = Ci[1, -1] 0.7” + [2,3]. 


Finally we invoke the specified initial conditions p[0] = [1/2,1/2] to obtain C) = —3 
and 


p[n] = [—g, 3] 0.7" + [2, gl, or in scalar form, 
poln] = —20.7" +2 
" g ‘ 3 for n > 0. 
pi[n] = 59.7" + 3 
Here we see that the steady-state probabilities exist and are po[co] = 3 and p,[oo] = 3 


The next example shows that such steady-state probabilities do not always exist. 


Example 8.5-9 
(ping pong) Consider the two-state Markov chain with transition probability matrix 


P= ' ; . The characteristic equation becomes 


det(zI — P) = det & 7) = z7-1=0, 


with two roots z1,2 = +1. Thus there is no steady state in this case, even though one 
of the eigenvalues of P is 1. Indeed, direct from the state-transition diagram, we can see 
that the random sequence will forever cycle back and forth between states 0 and 1 with 
each successive time tick. The phase can be irrevocably set by the initial probability vector 


p[0] = [1, 0). 
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While we cannot always assume a steady state exists, note that this example is degen- 
erate in that the transition probabilities into and out of states are either 0 or 1. Another 
problem for existence of the steady state is a so-called trapping state. This is a state with 
transitions into, but not out of, itself. In most cases of interest in communications and signal 
processing, a steady state will exist, independent of where the chain starts. 


8.6 VECTOR RANDOM SEQUENCES AND STATE EQUATIONS 


The scalar random sequence concepts we have seen thus far can be extended to vector 
random sequences. They are used in Chapter 11 to derive linear estimators for signals in 
noise (Kalman filter). They are also used in models of sensor arrays, for example, seismic, 
acoustic, and radar. This section will introduce difference equations for random vectors and 
the concept of vector Markov random sequence. Interestingly, a high-order Markov-p scalar 
random sequence can be represented as a first-order vector Markov sequence. 


Definition 8.6-1 <A vector random sequence is a mapping from a probability sample 
space 9, corresponding to probability space (Q,.% P), into the space of vector-valued 
sequences over complex numbers. [ff 


Thus for each ¢ € 2 and fixed time n, we generate a vector X(n,¢). The vector random 
sequence is usually written X[n], suppressing the outcome ¢. 
For example the first-order CDF for a random vector sequence X[n], would be given as 


Fx(x;n) 2 P{X[n] < x}, 


where {X[n] < x} means every element satisfies the inequality, that is, {Xi[n] < #1, X2[n] < 
Q,-.-,Xn[n] < ay}. Second- and higher-order probabilities would be specified accordingly. 
The vector random sequence is said to be statistically specified by the set of all its first- and 
higher-order CDFs (or pdf’s or PMFs). 

The following example treats the correlation analysis for a vector random sequence 
input to a vector LCCDE 

y[n] = Ay[n — 1] + Bx[n], 

with N-dimensional coefficient matrices A and B. In this vector case, BIBO stability is 


assured when the eigenvalues of A are less than one in magnitude. 


Example 8.6-1 
(vector LCCDEs) In the vector case, the scalar first-order LCCDE model, excited by column 
vector random sequence X[n], becomes 


Y(n] = AY[n — 1] + BX[n], (8.6-1) 


which is a first-order vector difference equation in the sample vector sequences. The vector 
impulse response is the column vector sequence 
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h[n] = A” Buln], 


and the zero initial-condition response to the sequence X[n] is 


The matrix system function is 
H(z) = (I— Az-+)-'B, 


as can easily be verified. The WSS cross-correlation matrices Ryx|m] 2B {Y[n+m]X1[n}} 


(where the “{” indicates the Hermitian (or conjugate) transpose) and Rxy|m] S 
E{X[n + m]Y"[n]}, between an input WSS random vector sequence and its WSS output 
random sequence, become 


Ryx([m| = h[m] * Rxx[m]J, 
Rxy[m] = Rxx[m] « h'[—m]. 


Parenthetically, we note that for a causal h, such as would arise from recursive solution 
of the above vector LCCDE, we have the output Y[n] uncorrelated with the future values 
of the input X[n], when the input X = W is assumed a white noise vector sequence. 

The output correlation matrix is 


Ryy[m] = h[m] * Rxx[m] * h'{[—m] 
and the output psd matrix becomes upon Fourier transformation 
Syy (w) = H(w)Sxx(w)H"(w). 


The total solution of Equation 8.6-1 for any n > ng can be written as 


¥[n] = A"-"Y [no] + S h[n — k]X[k], n> 0 


k=no 


in terms of the initial condition Y [no] that must be specified at no. In the limit as ng — —on, 
and for a stable system matrix A, this then becomes the convolution summation 


Y[n] = h[n] « X[n], —oo <n < +00. 


Definition 8.6-2 A vector random sequence Y([n] is vector Markov if for all K > 0 
and for allnx >nx_1 >... > 71, we have 


P{Y [nx] < yxlylnx—a],---,ylma]} = PLY [nx] < yxlylnx—-a]} 
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for all real values of the vector yx, and all conditioning vectors y[nx_1],..-,y|mi]. (cf. 
Definition 8.5-2 of Markov-p property.) 


We can now state the following theorem for vector random sequences: 
Theorem 8.6-1 In the state equation 
X[n] = AX[n — 1] + BW[n], for n > 0, with X{0] = 0, 


let the input W[n] be a white Gaussian random sequence. Then the output X[n] for n > 0 
is a vector Markov random sequence. 


The proof is left to the reader as an exercise. [i 


Example 8.6-2 
(relation between scalar Markov-p and vector Markov) Let X[n] be a Markov-p random 
sequence satisfying the pth order difference equation 


X[n] = a, X[n—- 1] +...+a,X[n— p] + bWIn]. 


Defining the p-dimensional vector random sequence X[n] = [X[n],...,X[n — p+ ]]*, 
and coefficient matrix 
ay ag foros Ap 
ik 0 0 0 
A=|0 1 ap 
. 0 
0 0 1 0 
we have 
X[n] = AX[n — 1] + bW{[n]. 
Thus X[n] is a vector Markov random sequence with b = [b,0,...,0]". Such a vector 


transformation of a scalar equation is called a state-variable representation [8-7]. 


8.7 CONVERGENCE OF RANDOM SEQUENCES 


Some nonstationary random sequences may converge to a limit as the sequence index goes 
to infinity, for example as time becomes infinite. This asymptotic behavior is evidenced 
in probability theory by convergence of the fraction of successes in an infinite Bernoulli 
sequence, where the relevant theorems are called the laws of large numbers. Also, when 
we study the convergence of random processes in Chapter 10 we will sometimes make a 
sequence of finer and finer approximations to the output of a random system at a given 
time, say to, that is, Y,(to). The index n then defines a random sequence, which should 
converge in some sense to the true output. In this section we will look at several types of 
convergence for random sequences, that is, sequences of random variables. 
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We start by reviewing the concept of convergence for deterministic sequences. Let x, 
be a sequence of complex (or real) numbers; then convergence is defined as follows. 


Definition 8.7-1 A sequence of complex (or real) numbers x, converges to the 
complex (or real) number « if given any ¢ > 0, there exists an integer no such that whenever 
n >No, we have 

lt, —a]<e. HH 


Note that in this definition the value no may depend on the value ¢; that is, when € 
is made smaller, most likely no will need to be made larger. Sometimes this dependence is 
formalized by writing mo(¢) in place of no in this definition. This is often written as 


lim tv, =2 or as In 7 LASN— OO. 
n—oo 


A practical problem with this definition is that one must have the limit x to test 
for convergence. For simple cases one can often guess what the limit is and then use the 
definition to verify that this limit indeed exists. Fortunately, for more complex situations 
there is an alternative in the Cauchy criterion for convergence, which we state as a theorem 
without proof. 


Theorem 8.7-1 (Cauchy criterion [8-8]) A sequence of complex (or real) numbers 
Xp converges to a limit if and only if (iff) 


|Zn — 2m| > 0 as both n and m = ov. 


The reason that this works for complex (or real) numbers is that the set of all complex (or 
real) numbers is complete, meaning that it contains all its limit points. For example, the 
set {0 < x < 1} = (0,1) is not complete, but the set {0 < « < 1} = [0,1] is complete 
because sequences x, in these sets and tending to 0 or 1 have a limit point in the set [0, 1] 
but have no limit point in the set (0,1). In fact, the set of all complex (or real) numbers 
is complete as well as n-dimensional linear vector spaces over both the real and complex 
number fields. Thus the Cauchy criterion for convergence applies in these cases also. For 
more on numerical convergence see [8-8]. 

Convergence for functions is defined using the concept of convergence of sequences of 
numbers. We say the sequence of functions f,(@) converges to the function f(x) if the 
corresponding sequence of numbers converges for each x. It is stated more formally in the 
following definition. 


Definition 8.7-2 The sequence of functions f,,(x) converges (pointwise) to the func- 
tion f(x) if for each xo the sequence of complex numbers f,,(ao) converges to f(xo). 


The Cauchy criterion for convergence applies to pointwise convergence of functions 
if the set of functions under consideration is complete. The set of continuous functions 
is not complete because a sequence of continuous functions may converge to a discontin- 
uous function (cf. item (d) in Example 8.7-1). However, the set of bounded functions is 
complete [8-8]. 

The following are some examples of convergent sequences of numbers and functions. 
We leave the demonstration of these results as exercises for the reader. 
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Example 8.7-1 
(some convergent sequences) 


(a) @ = (1—1/n)a+ (1/n)b — aasn— ow. 

(b) x, =sin(w+e~") > sinw as n > ov. 

(c) fa(e) = sin[(w + 1/n)a] — sin(wax), as n > oo for any (fixed) a. 
) 


fn(z) = <a ie 7 p a — u(—2a), as n — oo for any (fixed) a. 
The reader should note that in the convergence of the functions in (c) and (d), the variable x 
is held constant as the limit is being taken. The limit is then repeated for each such x value 
to find the limiting function. 


Since a random variable is a function, a sequence of random variables (also called a 
random sequence) is a sequence of functions. Thus, we can define the first and strongest 
type of convergence for random variables. 


Definition 8.7-3 (sure convergence) The random sequence X[n] converges surely to 
the random variable X if the sequence of functions X[n, ¢] converges to the function X(¢) 
as n — oo for all outcomes C¢ EQ. 


As a reminder, the functions X(¢) are not arbitrary. They are random variables and 
thus satisfy the condition that the set {¢: X(¢) < «} C ¥ for all x, that is, that this set 
be an event for all values of x. This is in fact necessary for the calculation of probability 
since the probability measure P is defined only for events. Such functions X are more 
generally called measurable functions and in a course on real analysis it is shown that the 
space of measurable functions is complete [8-1]. If we have a Cauchy sequence of measurable 
functions (random variables), then one can show that the limit function exists and is also 
measurable (a random variable). Thus, the Cauchy convergence criterion also applies for 
random variables. 

Most of the time we are not interested in precisely defining random variables for sets 
in Q of probability zero because it is thought that these events will never occur. In this 
case, we can weaken the concept of sure convergence to the still very strong concept of 
almost-sure convergence. 


Definition 8.7-4 (almost-sure convergence) The random sequence X|n] converges 
almost surely to the random variable X if the sequence of functions X[n,¢] converges for 
all outcomes ¢ € 2 except possibly on a set of probability zero. [yj 


This is the strongest type of convergence normally used in probability theory. It is also 
called probability-1 convergence. It is sometimes written 


P{ lim X[n,¢] = X(¢)} =1, 


meaning simply that there is a set A such that P[A] = 1 and X[n] converges to X for all 


¢ € A. In particular A . {¢: limp. X[n,¢] = X(¢)}. Here the set A‘ is the probability- 
zero set mentioned in this definition. As shorthand notation we also use 
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X[n]} — X as. and X{[n] — X pr. 1, 


where the abbreviation “a.s.” stands for almost surely, and “pr.1” stands for probability 1. 

An example of probability-1 convergence is the Strong Law of Large Numbers to be 
proved in the next section. Three examples of random sequences are next evaluated for 
possible convergence. 


Example 8.7-2 
(convergence of random sequences) For each of the following three random sequences, we 
assume that the probability space (0Q,.% P) has sample space 2 = [0,1]. .% is the family 
of Borel subsets of 2 and the probability measure P is Lebesgue measure, which on a real 
interval (a, b] is just its length 1, that is, 


\(a,b] 2b—a for b>a. 


(a) Xn, ¢] =né. 
(b) X{n,¢] = sin(n¢). 
(c) X{n,¢] = exp[-n2(¢ - 2)]. 


The sequence in (a) clearly diverges to +00 for any ¢ 4 0. Thus this random sequence 
does not converge. The sequence in (b) does not diverge, but it oscillates between —1 and 
+1 except for the one point ¢ = 0. Thus this random sequence does not converge either. 

Considering the random sequence in (c), the graph in Figure 8.7-1 shows that this 
sequence converges as follows: 


. co for ¢=0 
pee { 0 a 0. 


60 


Xing], 


0 O01 02 03 04 05 06 07 08 09 1 
¢ 


Figure 8.7-1 Plot of sequence (c) X[n,¢] versus ¢ for 2 = [0,1] forn=1,...,4. 
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Thus, we can say that the random sequence converges to the (degenerate) random 
variable X = 0 with probability 1. We simply take A = (0, 1] and note that P[A] = 1 and 
that X[n,¢] ~ 0 for every ¢ in A for sufficiently large n. We write X[n] — 0 a.s. However, 
X([n] clearly does not converge surely to zero. 


Thus far we have been discussing pointwise convergence of sequences of functions and 
random sequences. This is similar to considering a space of bounded functions .7 with the 
norm 


IFlloc ® sup |f(@)|-4 


When we write f, — f in the function space .2, we mean that ||fn— fll. = sup, |fn(x) — 
f(x)| — 0, giving us pointwise convergence. The space of continuous bounded functions is 
denoted LZ, and is known to be complete ((8-1], p. 115). 

Another type of function space of great practical interest uses the energy norm (cf. 


Equation 4.4-6): 
ties 1/2 
vile? (f {Par ) 


The space of integrable (measurable) functions with finite energy norm is denoted L?. When 
we say a sequence of functions converges in L?, that is, || fn — f||2 + 0, we mean that 


(/ fala) - fla)Pte) +0 as n+ 00. 


—co 


This space of integrable functions is also complete [8-1]. A corresponding concept for random 
sequences is given by mean-square convergence. 


Definition 8.7-5 (mean-square convergence) A random sequence X[n] converges in 
the mean-square sense to the random variable X if E{|X[n] — X|?} ~Oasn— oo. 


This type of convergence depends only on the second-order properties of the random 
variables and is thus often easier to calculate than a.s. convergence. A second benefit of the 
mean-square type of convergence is that it is closely related to the physical concept of power. 
If X[n] converges to X in the mean-square sense, then we can expect that the variance of 


the error €[n] 4x [n] — X will be small for large n. If we look back at Example 8.7-2, (c), 
we can see that this random sequence does not converge in the mean-square sense, so that 
the error variance or power as defined here would not ever be expected to be small. To see 
this, consider possible mean-square convergence to zero (since X[n] — 0 a.s.), 


+The supremum or sup operator is similar to the max operator. The supremum of a set. of numbers is 
the smallest number greater than or equal to each number in the set, for example, sup{0 < « < 1} = 1. 
Note the difficulty with max in this example since 1 is not included in the open interval (0,1); thus the max 
does not exist here! 
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E{|X[n] — 0|?} = E{X[n]*} 
= | exp(—2n?¢) exp 2nd¢ 


1 
= exp(2n) f exp(—2n?¢)d¢ 


1 — exp(—2n?) 
2n? 


= exp(2n) | | co as m— ov. 
Hence X[n] does not converge in the mean-square sense to 0. 

Still another type of convergence that we will consider is called convergence in proba- 
bility. It is weaker than probability-1 convergence and also weaker than mean-square conver- 
gence. This is the type of convergence displayed in the Weak Laws of Large Numbers to be 
discussed in the next section. It is defined as follows: 


Definition 8.7-6 (convergence in probability) Given the random sequence X[n] and 
the limiting random variable X, we say that X|n] converges in probability to X if for every 
e>0, 

Jim P [|X[nJ —X|><«=0. Hf 


We sometimes write X[n] — X(p), where (p) denotes the type of convergence. Also conver- 
gence in probability is sometimes called p-convergence. 


One can use Chebyshev’s inequality (Theorem 4.4-1), P[|Y] > ¢] < E[|Y|?]/e? for e > 0, 
to show that mean-square convergence implies convergence in probability. For example, let 


rox [n] — X; then the preceding inequality becomes 
P{|X[n] — X| > e] < E [|X [n] — X|?] /e?. 


Now mean-square convergence implies that the right-hand side goes to zero as n — oo, for 
any fixed ¢« > 0, which implies that the left-hand side must also go to zero, which is the 
definition of convergence in probability. Thus we have proved the following result. 


Theorem 8.7-2 Convergence of a random sequence in the mean-square sense implies 
convergence in probability. [yj 


The relation between convergence with probability 1 and convergence in probability is 
more subtle. The main difference between them can be seen by noting that the former talks 
about the probability of the limit while the latter talks about the limit of the probability. 
Further insight can be gained by noting that a.s. convergence is concerned with convergence 
of the entire sample sequences while p-convergence is concerned only with the convergence 
of the random variable at an individual n. That is to say, a.s. convergence is concerned with 
the joint events at an infinite number of times, while p-convergence is concerned with the 
simple event at time n, albeit large. One can prove the following theorem. 


Theorem 8.7-3 Convergence with probability 1 implies convergence in proba- 
bility. 
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Proof (adapted from Gnedenko [8-9].) Let X[n] — X a.s. and define the set A, 


co CO Cw 


AZ QU 1 {6 Xn+m,.d - XO < 1a}. 


k=1n=1m=1 


Then it must be that P[A] = 1. To see this we note that A is the set of ¢ such that starting 
at some n and for all later n we have |X [n,¢] — X(¢)| < 1/k and furthermore this must hold 
for all k > 0. Thus, A is precisely the set of ¢ on which X[n, ¢] is convergent. So P[A] must 
be 1. Eventually for n large enough and 1/k small enough we get |X[n,¢] — X()| < e, and 
the error stays this small for all larger n. Thus, 


UN {|X[n + m] - xi<e|=1 for all e > 0, 
which implies by the continuity of probability, 


lim P 


n—-oo 


q {|X[n + m] - Ki<e =1  foralle >0, 


m=1 


which in turn implies the greatly weakened result 


iim P||X[n+ m] — X| <e]=1 for all e > 0, (8.7-1) 
which is equivalent to the definition of p-convergence. [jj 
Because of the gross weakening of the a.s. condition, that is, the enlargement of the set A 
in the foregoing proof, it can be seen that p-convergence does not imply a.s. convergence. 
We note in particular that Equation 8.7-1 may well be true even though no single sample 
sequence stays close to X for alln-+m > n. This is in fact the key difference between these 
two types of convergence. 


Example 8.7-3 
(a convergent random sequence?) Define a random pulse sequence X[n] on n > 0 as follows: 
Set X[0] = 1. Then for the nezt two points set exactly one of the X[n]’s to 1, equally 
likely among the two points, and the other to 0. For the nezt three points set exactly one 
of the X[n]’s to 1 equally likely among the three points and set the others to 0. Continue 
this procedure for the nezt four points, setting exactly one of the X[n]’s to 1 equally likely 
among the four points and the others to 0 and so forth. A sample function would look like 
Figure 8.7-2. 

Obviously this random sequence is slowly converging to zero in some sense as n — oo. 
In fact a simple calculation would show p-convergence and also mean-square convergence 
due to the growing distance between pulses as n — oo. In fact at n ~ 417, the probability 
of a one (pulse) is only 1/1. However, we do not have a.s. convergence, since every sample 
sequence has ones appearing arbitrarily far out on the n axis. Thus no sample sequences 
converge to zero. 


520 Chapter 8 Random Sequences 


Figure 8.7-2 A sequence that is converging in probability but not with probability 1. 


la 


DD s- surely 
SS as- almost surely 
ms- mean square 


p- probability 
p d- distribution 


Figure 8.7-3 Venn diagram illustrating relationships of various possible convergence modes for random 
sequences. 


One final type of convergence that we consider is not a convergence for random variables 
at all! Rather it is a type of convergence for distribution functions. 


Definition 8.7-7 A random sequence X[n] with CDF F;,(a) converges in distribution 
to the random variable X with CDF F(z) if 
lim F,,(a) = F(a) 


n— co 
at all x for which F is continuous. I 


Note that in this definition we are not really saying anything about the random variables 
themselves, just their CDF's. Convergence in distribution just means that as n gets large the 
CDFs are converging or becoming alike. For example, the sequence X[n] and the variable 
X can be jointly independent even though X[n] converges to X in distribution. This is 
radically different from the four earlier types of convergence, where as n gets large the 
random variables X[n] and X are becoming very dependent because some type of “error” 
between them is going to zero. Convergence in distribution is the type of convergence that 
occurs in the Central Limit Theorem (see Section 4.7). The relationships between these five 
types of convergence are shown diagrammatically in Figure 8.7-3, where we have used the 
fact that p-convergence implies convergence in distribution, which is shown below. Note that 
even sure convergence may not imply mean-square convergence. This because the integral 
of the square of the limiting random variable, with respect to the probability measure, may 
diverge. 
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To see that p-convergence implies convergence in distribution, assume that the limiting 
random variable X is continuous so that it has a pdf. First we consider the conditional 
distribution function 

Pxinyix (y|e) = P{X[n] < y|X = z}. 


From the definition of p-convergence, it should be clear that 


Fxinjx(ylz) > i ; as N — 00, 
so that 
Fxinjx(y|z) > u(y— 2), except possibly at the one point y = 2, 
and hence 


+00 
Fx (v) = P{X[n] < y} = a Fy in x(yle) fx (ade 


+00 
3 / ig eae 


= Fx(y), 


as was to be shown. In the case where the limiting random variable X is not continuous, 
we must exercise more care but the result is still true at all points x for which F'y(x) is 
continuous. (See Problem 8.54.) 


8.8 LAWS OF LARGE NUMBERS 


The Laws of Large Numbers have to do with the convergence of a sequence of estimates 
of the mean of a random variable. As such they concern the convergence of a random 
sequence to a constant. The Weak Laws obtain convergence in probability, while the Strong 
Laws yield convergence with probability 1. A version of the Weak Law has already been 
demonstrated in Example 4.4-3. We restate it here for convenience. 


Theorem 8.8-1 (Weak Law of Large Numbers) Let X[n] be an independent random 
sequence with mean fix and variance o% defined for n > 1. Define another random 
sequence as 


jix(n] 2 (1/n) S> X[k] forn > 1. 
k=1 


Then jfix[n] - pu, (p) asn—-oo. HH 
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Remember, an independent random sequence is one whose terms are all jointly inde- 
pendent. Another version of the Weak Law allows the random sequence to be of nonuniform 
variance. 


Theorem 8.8-2 (Weak Law—nonuniform variance) Let X|n] be an independent 
random sequence with constant mean vx and variance o%[n] defined for n > 1. Then if 


fx|n]>px (p) asn>oo. 


Both of these theorems are also true for convergence with probability 1, in which case 
they become Strong Laws. The theorems concerning convergence with probability 1 are 
best derived using the concept of a Martingale sequence. By introducing this concept we 
can also get another useful result called the Martingale convergence theorem, which is 
helpful in estimation and decision/detection theory. 


Definition 8.8-1 ‘A random sequence X[n] defined for n > 0 is called a Martingale 
if the conditional expectation 


E{X[n]|X[n — 1], X[n — 2],..., X[0]} = X[n — 1] foraln>1. 


Viewing the conditional expectation as an estimate of the present value of the sequence 
based on the past, then for a Martingale this estimate is just the most recent past value. If 
we interpret X[n] as an amount of capital in a betting game, then the Martingale condition 
can be regarded as necessary for fairness of the game, which in fact is how it was first 
introduced [8-1]. 


Example 8.8-1 
(binomial counting sequence) Let Wn] be a Bernoulli random sequence taking values +1 
with equal probability and defined for n > 0. Let X[n] be the corresponding Binomial 
counting sequence 


X(n] & s Wk],  n2>0. 
k=0 


Then X[n] is a Martingale, which can be shown as follows: 


E{X(n]|X[n—1,..., X[0]} = E ps WIK]|X[n — 1]... xo] 


k=0 


+The material dealing with Martingale sequences can be omitted on a first reading. 
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The first equality follows from the definition of X[n]. The third equality follows from the 
fact that knowledge of the first (n — 1) Xs is equivalent to knowledge of the first (n — 1) 
Ws. The next-to-last equality follows from E[W|W] = W. The last equality follows from 
the fact that E{W[n]} =0. 


Example 8.8-2 
(independent-increments sequences) Let X|n] be an independent-increments random 
sequence (see Definition 8.1-4) defined for n > 0. Then X,[n] 2x [n] — x[n] is a Martin- 
gale. To show this we write X,[n] = (X.[n] — X-[n — 1]) + X-[n — 1] and note that by 
independent increments and the fact that the mean of X;, is zero, we have 


E{Xe[n|Xe[n—U,-.., Xel0]} = H{Xeln] — Xeln— 1] Xeln— 1],..., Xe[O]} 
+E{X,[n — 1]|X,[n — 1],...,X-[0]} 
= E{X,[n| — Xen —1)} + Xe[0— 1] 
= X,[n — 1]. 


The next theorem shows the connection between the Strong Laws, which have to do 
with the convergence of sample sequences, and Martingales. It provides a kind of Chebyshev 
inequality for the maximum term in an n-point Martingale sequence. 


Theorem 8.8-3 Let X[n] be a Martingale sequence defined on n > 0. Then for every 
€ > 0 and for any positive n, 


P| max. |X [k]| > | < E{X?[n]}/e?. 


Proof For0<j <n, define the mutually exclusive events, 
Aj & {|X[k] > ¢ for the first time at 7}. 


Then the event {maxo<z<n |X[k]| > e} is just a union of these events. Also define the 
random variables, 

iA a if A; occurs, 

J" \0, otherwise, 


called the indicators of the events A;. Then 
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E{X? In} > So EL X? [n] 5} (8.8-1) 
j=0 


since ))j 91; <1. Also X?[n] = (X[j] + (X[n] - X{j]))”, so expanding and inserting into 
Equation 8.8-1 we get 


E{X?[n]} > SO ELX?*[]L} + 20 EEX] (X[n] — XU) 
j=0 j=0 
+90 E{(X[n] — XU)? Gt 
j=0 
> D5 EX} +20 {XU (XIn] - Xi) Gi}. (8.8-2) 
j=0 j=0 


Letting Z; Es X([j|I;, we can write the second term in Equation 8.8-2 as E{Z; (X[n] — X[j])} 
and noting that Z; depends only on X[0],...,X[j], we then have 


EY Z;(X[n] — Xj) } = E{E[Z; (XIn] — X19) |X 10), ---, XT 
= E{Z;E[X[n] — X[j]|X10],..., XJ} 
= E{Z; (XUj] — XB} 
= 0. 


Thus Equation 8.8-2 becomes 


E{X?[n}} > SO BLP G3 
j=0 


=P} wage (RE >}. . 


0<k<n 


Theorem 8.8-4 (Martingale Convergence theorem) Let X[n] be a Martingale 
sequence on n > 0, satisfying 


E{X?[n]} <C <oo for all n for some C. 
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Then 
X[n] — X (as.) asn— ov, 
where X is the limiting random variable. 
Proof Let m > 0 and define Y[n] 2 
Martingale, so by Theorem 8.8-3 


X{[n+ mj] — X[m] for n > 0. Then Y[n] is a 


1 2 
P max |X| + k] — X[m]| Sel < EY [n]} , 


where 
E{Y?[n}} = E{(X[n + m] — X[m))?} 
= E{X?[n+ mJ} — 2E{X[n + m]X[m]} + E{X?[m}}. 


Rewriting the middle term, we have 
E{X|m|X[n + m]} = ELX|[mJE[X[n + m]|X[m],..., X[O]]} 
= E{X|m]X|m]} 
= E{X?[m]} since X is a Martingale, 
so 
E{Y?[n]} = E{X?[n + m]} — E{X?[m]} > 0 for all m,n > 0. (8.8-3) 


Therefore E{X?[n]} must be monotonic nondecreasing. Since it is bounded from above 
by C' < ov, it must converge to a limit. Since it has a limit, then by Equation 8.8-3, the 
E{Y?[n]} — 0 as m and n = ov. Thus, 


lim P max |X[m + k] — X[mJ| > ¢ =0, 


m—0o k> 
which implies Pilim,—... maxz>o |X[m + k] — X[m]| > e] = 0 by the continuity of the 
probability measure P (cf. Corollary to Theorem 8.1-1). Finally by the Cauchy convergence 
criteria, there exists a random variable X such that 


X{[n] - X (as.). 


Theorem 8.8-5 (Strong Law of Large Numbers) Let X[n] be a WSS independent 
random sequence with mean jx and variance 0% defined for n > 1. Then as n — 00 


Proof Let Y[n] S rei ¢Xelk]; then Y[n] is a Martingale on n > 1. Since 
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we can apply Theorem 8.8-4 to show that Y[n] — Y (a.s.) for some random variable Y. 
Next noting that X.[k] = k(Y[k] — Y[k— 1]), we can write 


1Y> x.{H ~ 5 eV TA — STAY Te 1] 
k=1 k=1 k=1 


I 


nm 


-= SIRI gor hy 
k=) 


—-Y+Y=0 (a.s.) 
so that 


SUMMARY 


In this chapter we introduced the concept of a random sequence and studied its properties 
and ways to characterize it. We defined the random sequence as a family of sample sequences 
each associated with an outcome or point in the sample space. We introduced several impor- 
tant random sequences. Then we reviewed linear discrete-time theory and considered the 
practical problem of finding out how sample sequences are modified as they pass through 
the system. Our emphasis was on how the mean and covariance function are transformed 
by a linear system. We then considered the special but important case of stationary and 
WSS random sequences and introduced the concept of power spectral density for them. 
We looked at convergence of random sequences and learned to appreciate the variety of 
modes of convergence that are possible. We then applied some of these results to the laws 
of large numbers and used Martingale properties to prove the important strong law of large 
numbers. 

Some additional sources for the material in this chapters are [8-9], [8-10], and [8-11]. 

In the next chapter we will discover that many of these results extend to the case of 
continuous time as we continue our study with random processes. 


PROBLEMS 


(*Starred problems are more advanced and may require more work and/or additional 
reading.) 
8.1 Prove the chain rule for the probability of the intersection of N events, {A,}4_,. 
For example, for N = 3 we have, 
PA, A2A3] = P[Ai]P[A2|Ai]P[A3]A1 Ag]. 
Interpret this result for joint CDFs and joint pdf’s. 
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*8.2 


8.3 


8.4 


*8.5 


8.6 


Consider an N-dimensional random vector X. Show that pairwise independence of 
its random variable components does not imply that the components are jointly 
independent. 
Let X = (X1, Xo,...,X5)" be a random vector whose components satisfy the 
equations 

X,=Xi-1+B;, 1<1<sd, 


where the B;, are jointly independent and Bernoulli distributed, taking on values 0 
and 1, with mean value 1/2. The first value is X,; = B,. Put the B; together to 
make a random vector B. 


(a) Write X = AB for some constant matrix A and determine A. 
(b) Find the mean vector pix. 

(c) Find the covariance matrix Kpp. 

(d) Find the covariance matrix Kxx. 


[For parts (b) through (d), express your answers in terms of the matrix A]. 


Let a collection of sequences x(n, 6;) be given in terms of a deterministic parameter 


Op as 
27rn Nas 
{cos ==" + any} : 
5 k=0 


Now define a random variable © taking on values from the same parameter set {0;}. 
Let the PMF of © be given as 


Po(Ok) = 57 for k=0,...,N-1. 


Now set X[n] = cos(22" + @). 
(a) Is X[n] a random sequence? If so, describe both the mapping X(n,¢) and 
its probability space (0,/,P). If not, explain fully. 
(b) Let 0, = 22% for k =0,...,N —1, and find E{X[n]}.1 
(c) For the same 6; as in part (b), find E{X[n]X|[m]}. Take N > 2 here. 


Often one is given a problem statement starting as follows: “Let X be a real-valued 
random variable with pdf fx(a)....” Since an RV is a mapping from a sample space 
Q with field of events .Y and a probability measure P, evidently the existence of 
an underlying probability space (0,.4 P) is assumed by such a problem statement. 
Show that a suitable underlying probability space (Q,.% P) can always be created, 
thus legitimizing problem statements such as the one above. 

Let T be a continuous random variable denoting the time at which the first photon 
is emitted from a light source; T is measured from the instant the source is ener- 
gized. Assume that the probability density function for T is fr(t) = \e~*u(t) with 
A > 0. 


tNote: cos(A + B) = cos Acos B— sin AsinB and cos Acos B= 4 {cos(A + B)+cos(A — B)}. 
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(a) What is the probability that at least one photon is emitted prior to time tz 
if it is known that none was emitted prior to time t,, where t, < tg? 

(b) What is the probability that at least one photon is emitted prior to time t2 
if three independent sources of this type are energized simultaneously? 


8.7 Let X bea conditionally Normal random variable, with conditional density function 
N(,07), given the values of M = yp and }? = o?. 


(a) Assume o? is a known constant but that M is a random variable having the 
CDF 


Fu(m) = [1 — e7*™]Ju(m) 


(note that variable m is continuous here!) 


where A is a known positive value. Determine the characteristic function for 
X (Hint: First define a conditional characteristic function.) 

(b) Now assume both ©? and M are independent random variables. Let their 
distributions be arbitrary, but assume both have a finite mean and variance. 
Determine the mean and variance for X in terms of yy, y2, and o%,. 


8.8 Let X and Y be iid. random variables with the exponential probability density 
functions 
fx(w) = fy(w) = Ae?” u(w). 


(a) Determine the probability density function for the ratio 


& 


° 
IA 
ay 


X+Y < 1, that is, fr(r), 0 <r < ie 
(b) Let A be the event X < 1/Y. Determine the conditional pdf of X given that 
A occurs and that Y = y; that is, determine 


fx(a|A,Y = y). 


(c) Using the definitions of (b), what is the minimum mean-square error estimate 
of X given that the event A occurs and that Y = y? 


8.9 Use the Schwarz inequality for complex random variables to prove that 
|Rx[m]| < Rx[0], for all integers m 


for any complex-valued WSS random sequence X [ml]. 
8.10 Let X = (Xj, Xo2,...,Xi0)" be a random vector whose components satisfy the 


equations, 
2 
X,= 5 (Xi-1 + Xigi) +W, for2<i<9, 


where the W; are independent and Laplacian distributed with mean zero and vari- 
ance o? for i=1 to 10, and Xj = $X2+ 3W, and Xi = 5X9 + 2Wiro. 
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(a) Find the mean vector py. 
(b) Find the covariance matrix Kxx. 
(c) Write an expression for the multidimensional pdf of the random vector X. 


[Hint: 
1 ie 2 oe 
c|\@ + © & 
Matrix identity: ifA=|p?  p 1 ert | 
ag. pe 3 p 
pe sy 8 p 1 
then AW! is given as 
l-—pa -a 0 0 
—a 1 -—a 0 oe 
BAt= 0 =o aes 0 
sae 0 oie eee —a 
0 ine 0 -a 1-pa 
with a& ae and 3? 4 ro. The Laplacian pdf is given as 
fw(w) = : exp ( vil) —oo < w < +00 
JI o ’ — = 


8.11 Prove Corollary 8.1-1. 
8.12 Let {X;} be a sequence of ii.d. Normal random variables with zero-mean and unit 
variance. Let 


oS Md Reda dey Pop Sd, 


Determine the joint probability density function for S, and S,,,, where 1 <m <n. 
8.13 In Example 8.1-8 we saw that CDFs are continuous from the right. Are they contin- 
uous from the left also? Either prove or give a counterexample. 
8.14 Let the probability space (Q,.% P) be given as follows: 


Q = {a,b,c}, that is, the outcome ¢ = a or b or ¢, 
F = all subsets of Q, 


P{{¢}] = 1/8 for each outcome ¢. 


Let the random sequence X[n] be defined as follows: 
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8.15 


8.16 


8.17 


8.18 


(a) Find the mean function pry [nJ. 
(b) Find the correlation function Rxx[m,n]. 
(c) Are X[1] and X[0] independent? Why? 


Let the stationary Gaussian random sequence Y[n] have mean zero and covariance 
function 


Kyy|m] = 07p'™ for — 0c <m<-+too, where |p| <1. 


(a) Solve for the conditional mean of X[n] given X[n — 1]. 
(b) In what sense is this a good predictor of X[n]? 


Consider a random sequence X[n] as the input to a linear filter with impulse response 


1/2, n=0 
h[n] = ¢ 1/2, n=1 
0, else. 


We denote the output random sequence Y[n], that is, for each outcome ¢, 


k=+00 


Yin, = 32 hlkiX[n—k,¢. 


k=—0o 


Assume the filter runs for all time, —oco < n < +00. We are given the mean function 
of the input j2y[n] and correlation function of the input Rx x[n1, 2]. Express your 
answers in terms of these assumed known functions. 


(a) Find the mean function of the output p1y-[n]. 

(b) Find the autocorrelation function of the output Ryy[ni, ng]. 

(c) Write the autocovariance function of the output Kyy|ni, n2] in terms of your 
answers to parts (a) and (b). 

(d) Now assume that the input X[n] is a Gaussian random sequence, and write 
the corresponding joint pdf of the output fy (y1, y2; 71,2) at two arbitrary 
times ny # ng in terms of py [n] and Kyy|[nz, ng]. 

The random arrival time sequence T[n], defined for n > 1, was found to have the 
Erlang type pdf, for some A > 0: 

ne 
———— \e —At t). 

ay Nexn(—M) ut) 

Find the joint pdf fr(te,ti;10,5). Recall that T[n] is the sum up to time n of the 
i.i.d. interarrival time sequence with exponential pdf’s. 

Let T[n] denote the random arrival sequence studied in class, 


fr(tn) = 


Tin] = > 71h), 


k=1 


where the 7[k] are an independent random sequence of interarrival times, distributed 
as exponential with parameter \ > 0. 
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8.19 


8.20 


(a) Find the CF of this random sequence, that is, 
@®p(w;n) = E[ei#7™). 
(b) Use this CF to find the mean function p(n]. 


Let the random sequence T[n] be defined on n > 1 and for each n, have an Erlang 
pdf: 
Lane” 


joie vO) A> 0. 


fr(tjn) = 
Define the new random sequence T[n] S T([n|—T[n—]1] for n > 2, and set 7[1] S T[1]. 
Can we conclude that 7[n] is exponential with the same parameter A? If not, what 
additional information on the random sequence Tn] is needed? Justify your answer. 
This problem considers a random sequence model for a charge coupled device (CCD) 
array with very “leaky” cells. We start by defining the width-3 pulse function: 


t/4 w=] 

_ jJ1/2 n=0 

MS A aa 
0 else, 


and as illustrated in Figure P8.20, which we will use to account for 25 percent of 
the charge in a cell that leaks out to its right neighbor and 25 percent that leaks to 
its left neighbor. We assume that the one-dimensional CCD array is infinitely long 
and represents the array contents by the random sequence X: 


Figure P8.20 Pulse function of leaky cell. 
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where ¢, is the 7th component of ¢, the infinite dimensional outcome of the experi- 
ment. The random variables A(¢,) are jointly independent and Gaussian distributed 
with mean 4 and variance 4. 


(a) Find the mean function px [nJ. 
(b) Find the first-order pdf fx (a; n). 
(c) Find the joint pdf fx (a1, 22;n,n +1). 


8.21 We are given a random sequence X[n] for n > 0 with conditional pdf’s 
fx (@n|@n-1) = aexp[—a(@n — @n-1)] Un —Ln-1) forn>1, 


with u(x) the unit-step function and initial pdf fx (xo) = 6(#o). Take a > 0. 


(a) Find the first-order pdf fx(x,) for n = 2. 
(b) Find the first-order pdf fx (ap) for arbitrary n > 1 using mathematical induc- 
tion. 


8.22 Let a[n] be a deterministic input to the LSI discrete-time system H shown in 
Figure P8.23. 


(a) Use linearity and shift-invariance properties to show that 


+00 
y(n] = a[n] * h[n] & > a|k|h[n — k] = h{n] * x[n]. 
k=—oo 


(b) Define the Fourier transform of a sequence al|n] as 


foe) 
A(w) = S$ afrle#",  -m Sw < 40, 
n=—0o 
and show that the inverse Fourier transform is 
i yr? 


a{n] = oe A(w)etI" du, —oo <n < +00. 


(c) Using the results in (a) and (b), show that 
Y(w) = H(w)X(w), —m™ Swe4+T, 
for an LSI discrete-time system. 
8.23 Consider the difference equation 
y[n| + ay[n — 1] = 2[n], —oo <n < +00, 


where -—l <a < +1. 


(a) Let the input be z[n] = 8B” un] for —1 < 8 < +1. Find the solution for y[n| 
assuming causality applies, that is, y[n] = 0 for n < 0. 
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x[n] yin] 
—_— Al n] a 


Figure P8.23_ LSI system with impulse response h[n]. 


(b) Let the input be z[n] = B-"ul—n] for —1 < 6 < +1. Find the solution for 
y|n] assuming anticausality applies,' that is, y[n] =0 for n > 0. 


8.24 Let X[n] be a WSS random sequence with mean zero and covariance function 


Kxx[m] = 07)!" for all — 00 < m < +00, 


where p is a real constant. Consider difference equations of the form 
Y[n] = X[n] —aX[n—-1] with —o <n< +o. 


a) Write the covariance function of Y[n] in terms of the parameters 0%, p, and a. 
Write tl function of Y[n] in t f th t 2 pang 
(b) Find a value of a such that Y[n] is a WSS white noise sequence. 

(c) What is the average power of this white noise? 


8.25 Let W[n| be an independent random sequence with mean 0 and variance oj defined 
for —oo < n < +00. For appropriately chosen p, let the stationary random sequence 
X([n] satisfy the causal LCCDE 


X([n)] = pX[n — 1] + Wn, —oo <n < +00. 


(a) Show that X[n — 1] and W[n] are independent at time n. 
(b) Derive the characteristic function equation 


@x(w) = ®x (pw) Ow (w). 


(c) Find the continuous solution to this functional equation for the unknown 
function ®x when W[n] is assumed to be Gaussian. [Note: ®x (0) = 1.] 
(d) What is 0%? 


+This part requires more detailed knowledge of the z-transform. (cf. Appendix A.) 
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8.26 


8.27 


8.28 


8.29 


8.30 


Consider the LSI system shown in Figure P8.26, whose deterministic input x[n] 
is contaminated by noise (a random sequence) W |n]. We wish to determine the 
properties of the output random sequence Y [n]. The noise W[n] has mean py, [n] = 2 
and autocorrelation E{W[m|W[n]} = of,d[m — n] + 4. The impulse response is 
h[n] = p”uln| with |p| < 1. The deterministic input x[n] is given as 2[n] = 3 for 
all n. 


(a) Find the output mean py [nl]. 
(b) Find the output power E{Y?[n]}. 
(c) Find the output covariance Kyy|m, n]. 


WIn] 


x[n] Y[n] 
ALn] 


Figure P8.26 LSI system with deterministic-plus-noise input. 


Show that the random sequence X[n] generated in Example 8.1-15 is not an inde- 
pendent random sequence. 

Let W[n] be an independent random sequence with mean jiy, = 0 and variance oj,. 
Define a new random sequence X[n] as follows: 


X(0] =0 
X[n] = pX[n-1)+W [nn] for n>1. 


a) Find the mean value of X[n] for n > 0. 

b) Find the covariance of X[n], Kxx|[m,n]. 

(c) For what values of p does Kx x [m,n] tend to G[m—n] (for some finite-valued 
function G) as m and n become large? This situation is called asymptotic 
stationarity. 


Let the random variables A and B be i.i.d. with mean 0, variance o?, and third- 
order moment E[A?] = E[B?] = m3 4 0. Consider the random sequence X[n] = 
Acoswon+ Bsinwpn, —co <n < +00, where wy is a fixed radian frequency. 


(a) Show that X{[n] is WSS. 
(b) Prove that X[n] is not stationary by presenting a counterexample. 


Consider a WSS random sequence X[n] with mean jx[n] = pu, a constant, and 
correlation function Rxx[{m] = p?6[m] with p? > 0. In such a case pp must be 
zero, as you will show in this problem. Note that the covariance function here is 
Kxx[m] = p?d[m] — p?. 
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(a) Take m =0 and conclude that p? > p?. 
(b) Take a vector X of length N out of the random sequence X[n]. Show that 
the corresponding covariance matrix Kxx will be positive semidefinite only 
if u2 < o?/(N—1), where o? Sp? =p. (Hint: Take coefficient vector a = 1, 
ie., all 1’s.) 
(c) Let N — co and conclude that j: must be zero for the stationary white noise 
sequence X[n]. 
8.31 For the linear filter of Problem 8.16, assume the input random sequence is WSS, and 
write the psd of the output sequence Sy y(w) in terms of the psd of the input S'y x (w). 


(a) Show that the psd is a real-valued function, even if X[n] is a complex-valued 
random sequence. 

(b) Show that if X[n] is real valued, then Sx x(w) = Sx x(—w). 

(c) Show that Syx(w) > 0 for every w whether X[n] is complex-valued or not. 


8.32 Let the WSS random sequence X have correlation function 
Rxx[m] = 10e7*2!"! 4 5e—A2lenl 


with \; > 0 and A, > 0. Find the corresponding psd Sx x(w) for |w| < 7m. 

8.33 The psd of a certain random sequence is given as Sx x(w) = 1/[(1+ a?) — 2acosw}? 
for —1t < w < +7, where |a| < 1. Find the random sequence’s correlation func- 
tion Rx[mlJ. 

8.34 Let the input to system H(w) be W[n], a white noise random sequence with pry, [n] = 
Oand Kww([m] = 6[m]. Let X[n] denote the corresponding output random sequence. 
Find Kxw|m] and Sxw/(w). 

8.35 Consider the system shown in Figure P8.35. Let X[n] and V[n] be WSS and mutually 
uncorrelated with zero mean and psd’s Sx x(w) and Syy(w), respectively. 


Vin] 


Xin] { Yn] 
A[n] ——- 


Figure P8.35 LSI system with random signal-plus-noise input. 


(a) Find the psd of the output Syy(w). 
(b) Find the cross-power spectral density between the input X and the output Y, 
that is, find Sxy(w). 


8.36 Consider the discrete-time system with input random sequence X[n] and output 
Y[n] given as 


12 
Y[n] == S> X[n— kl. 
k=—2 
Assume that the input sequence X[n] is WSS with psd Sx x(w) = 2. 
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(a) Find the psd of the output random sequence Syy(w). 
(b) Find the output correlation function Ryy|m]. 

8.37 Let the stationary random sequence Y[n] = X[n]+U[n] with power spectral density 
(psd) Sy(w) be our model of signal X plus noise U for a certain discrete-time 
channel. Assume that X and U are orthogonal and also assume that we have 
Sy(w) > 0 for all |w| < a. As a first step in processing Y to find an estimate for 
X, let Y be input to a discrete-time filter G(w) defined as G(w) = 1/\/Sy(w) to 
produce the stationary output sequence W[n] as shown in Figure P8.37a. 


U[n] 


—— G@) 
© Y{n] ° 


Figure P8.37a 


(a) Find the psd of W[n], that is, Sy (w), and also the cross-power spectral density 
between original input and output Sxw(w), in terms of Sx, Sy, and Sy. 

(b) Next filter W[n] with an FIR impulse response h[n], n = 0,..., N —1, to give 
output X [n], an estimate of the original noise-free signal X[n] as shown in 
Figure P8.37b. In line with the Hilbert space theory of random variables, we 


TAN 
Win] X{n] 
h[n] 


Figure P8.37b 


decide to choose the filter coefficients h[n] so that the estimate error x [n] — 
X(n] will be orthogonal to all those W [n] actually used in making the estimate 
at time n. Write down the resulting equations for the N filter coefficients 
h[0], A[1], .... LN — 1]. Your answer should be in terms of the cross-correlation 
function Rxw[m. 
(c) Let N go to infinity, and write the frequency response of h[n], that is, H(w), 
in terms of the discrete-time power spectral densities Sx x(w) and Syy(w). 
8.38 Higher than second-order moments have proved useful in certain advanced applica- 
tions. Here we consider a third-order correlation function of a stationary random 
sequence 


Rx[m1, m2] 2 B{X[n + mi]X[n + mo] X*[n]} 


defined for the random sequence X[n], -co <n < +00. 
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(a) 


(b) 


8.39 


8.40 


(a) 
(b) 


8.41 


Let Y[n] be the output from an LSI system with impulse response h[n], due to the 
input random sequence X[n]. Determine a convolution-like equation expressing 
the third-order correlation function of the output Ry|[m1,mz2] in terms of the 
third-order correlation function of the input Rx |[m1, mz] and the system impulse 
response h[n]. 


Define the bi-spectral density of X as the two-dimensional Fourier transform 


Sx(wi,we) = SS Rx[mi, mg] exp —j(wim + wm). 


my me 


For the system of part (a), find an expression for the bi-spectral density of the 
output Sy(w,,w2) in terms of the system frequency response H(-) and the bi- 
spectral density of the input Sx(w1,w2). 


Let X[n] be a Markov chain on n > 0 taking values 1 and 2 with one-step transition 
probabilities, 


Py 2 P{X(n] =5|XIn-1J=3, 154,552, 
given in matrix form as 
0.9 0.1 
in be | = (pi,3)- 
We describe the state probabilities at time n by the vector 


pln] = [P{X[n] = 1}, P{X[n] = 2}]. 


(a) Show that p[n] = p[0]/P”. 

(b) Draw a two-state transition diagram and label the branches with the one- 
step transition probabilities p,;. Don’t forget the p;; or self-transitions. (See 
Figure 8.5-1 for state-transition diagram of a Markov chain.) 

(c) Given that X[0] = 1, find the probability that the first transition to state 2 
occurs at time n. 


Consider using a first-order Markov sequence to model a random sequence X [n] as 
X(n) = rX|[n-— 1) 4+ Z[n], 


where Z[n] is white noise of variance 07. Thus, we can look at X[n] as the output 
of passing Z[n] through a linear system. Take |r| < 1 and assume the system has 
been running for a long time, that is, —-oo <n < +00. 

Find the psd of X[nJ], that is, Sx x(w). 

Find the correlation function Rx x[m]. 
We defined a Markov random sequence X|[n] in this chapter as being specified by its 
first-order pdf fx (a;n) and its one-step conditional pdf 


fx(@n|@n-1;n,n — 1) = fx(an|an—1) for short. 
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(a) Find the two-step pdf for a Markov random sequence fx(Xp|Up—2) in terms 
of the above functions. Here, take n > 2 for a random sequence starting 
at n = 0. 
(b) Find the N-step pdf fx(%n|¢n_n) for arbitrary positive integer N, where we 
only need consider n > N. 
8.42 Consider a generalized random walk sequence X[n] running on {n > 0} and defined 
as follows: 


where W[n] is an independent random sequence, stationary, and taking values below 
with the indicated probabilities, 


Wine tena 
We see the difference is that the positive and negative step sizes are not the same 
81 # 82,8, >O and s2 >0. 
(a) Find the mean function x[n] 4 E{X{nJ}. 
(b) Find the autocorrelation function Rx[n1, n2] 4 E{X [ni] X [ng] }. 


8.43 Consider a Markov random sequence X[n] running on 1 < n < 100. It is statistically 
described by its first-order pdf fx (x; 1) and its one-step transition (conditional) pdf 
fx(@n|@n—-13n,n — 1). By the Markov definition, we have (suppressing the time 
variables) that 


fx (n|Cn—-1) = Fix (Gu|@n-1; Ban—2, tae ©) for 2 < nr < 100. 
Show that a Markov random sequence is also Markov in the reverse order, that is, 
Fx(tn|@n41) = fx(Snl@n41,8n42,+--,2100) for 1 <n < 99, 


and so one can alternatively statistically describe a Markov random sequence by the 
one-step backward pdf fx (a@n—1|U@njn — 1,n) and first-order pdf fx (a; 100). 

8.44 Given a Markov chain X[n] on n > 1, with transition probabilities given as 
P[z[n]|x[n — 1j], find an expression for the two-step transition probabilities 
P[a{n]|x[n — 2]]. Also show that 


P[a[n4 1]|x[n 1), 2[n — 2],...,a[1]] = P[z[n+ 1]|x[n — 1]], forn >1. 


8.45 Consider the Markov random sequence X[n] generated by the difference equation, 
for n> 1, 
X([n] = aX[n— 1] + BWIn], 
where the input W[n] is an independent random sequence with zero mean and vari- 
ance o7,, the inital value X [0] = 0, and the parameters a and 3 are known constants. 
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*8.46 


8.47 


8.48 


8.49 


8.50 


8.51 


(a) Show that the subsequence Y[n] 2x [2n] is Markov also. 
(b) Find the variance function o%-[n] 2 El|Y[n] — py(n]|?] for n > 0. 


Write a MATLAB function called triplemarkov that will compute and plot the auto- 
correlation functions for the asymmetric, two-state Markov model in Example 8.1-16 
for any three sets of parameters {poo, p11}. Denote the maximum lag interval as N. 
Run your routine for {0.2,0.8}, {0.2,0.5}, and {0.2,0.2}. Repeat for {0.8, 0.2}, 
{0.8,0.5}, and {0.8,0.8}. Describe what you observe. 

Consider the probability space (Q,.4% P) with Q = [0,1], .F defined to be the Borel 
sets of Q, and P[(0,¢] = ¢ for0<¢ <1. 


(a) Show that P[{0}] =0 by using the axioms of probability. 
(b) Determine in what senses the following random sequences converge: 
(i) X[n,¢] =e7",n>0 
(ii) X[n,¢] =sin(¢++4),n>1 
(iii) X[n,¢] = cos"(¢),n > 0. 
(c) If the preceding sequences converge, what are the limits? 


The members of the sequence of jointly independent random variables X[n]| have 
pdf’s of the form 


2 
fx(aj;n) = (1 =| — oo — (« “= *0) | 
1 


Determine whether or not the random sequence X[n] converges in 
(i) the mean-square sense, 
(ii) probability, 
(iii) distribution. 
The members of the random sequence X[n] have joint pdf’s of the form 


mn 
Qrv/1 — p? 


form >1 and n> 1 where —1 < p< +1. 


fx(a, 8;m,n) = 


exp ( x _ a) [m2a? — 2omnaB + n?6")) 


(a) Show that X[n] converges in the mean-square sense as n — oo for all —1 < 
p<. 
(b) Specify the CDF of the mean-square limit X 4 limp co X [n}. 


State conditions under which the mean-square limit of a sequence of Gaussian 
random variables is also Gaussian. 

Let X[n] be a real-valued random sequence on n > 0, made up from stationary and 
independent increments, that is, X[n] — X[n — 1] = Wn], “the increment” with 
Wn] being a stationary and independent random sequence. The random sequence 
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always starts with X{0] = 0. We also know that at time n = 1, E{X[1]} = 7 and 
Var{ X [1]} =a". 
(a) Find px[n] and o%[n], the mean and variance functions of the random 
sequence X at time n for any time n > 1. 
(b) Prove that X[n]/n converges in probability to 7 as the time n approaches 
infinity. 
8.52 This problem demonstrates that p-convergence implies convergence in distribution 
even when the limiting pdf does not exist. 


(a) For any real number « and any positive ¢, show that 

PIX <2=—e| < P[X|[n] <2) + P||X[n] —X| > €]. 
(b) Similarly show that 

PIX >a+e] < P[X[n] > a] +P[|X[n] — X| >]. 


For part (c), assume the random sequence X|[n] converges to the random 
variable X in probability. 
(c) Let m — oo and conclude that 


lim Fy (a;n) = Fx (x) 


n— Ooo 


at points of continuity of Fx. 


8.53 Let X[n] be a second-order random sequence. Let h[n] be the impulse response of 
an LSI system. We wish to define the output of the system Y{[n] as a mean-square 
limit. 


(a) Show that we can define the mean-square limit 


+00 
Y[n] = 3 Alk|X[n—k], 00 <n < +00, (ms.) 


k=—0o 
SO So Alkjh* (J Rxx[n — k,n —1] < co for all n. 
k l 


(Hint: Set Yy{[n] 2 fy h{k]X[n — k] and show that m.s. limit of Yiy[n] 
exists by using the Cauchy convergence criteria.) 

(b) Find a simpler condition for the case when X[n] is a wide-sense stationary 
random sequence. 


(c) Find the necessary condition on h[n] when X[n] is (stationary) white noise. 


8.54 If X[n] is a Martingale sequence on n > 0, show that 


E{X|[n+m]|X[m],...,X[0]} = X[m] for all n > 0. 
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8.55 


*8.56 


8.57 


8.58 


8.59 


Let Y[n] be a random sequence and X a random variable and consider the conditional 
expectation 


E{X|Y[0],...,¥[n]} 2 G[n]. 
Show that the random sequence G[n] is a Martingale. 
We can enlarge the concept of Martingale sequence somewhat as follows. Let G[n] 4 
g(X[0],..., X[n]) for each n > 0 for measurable functions g. We say G is a Martingale 
with respect to X if E{G[n]|X[0],...,X[n — 1]} = G[n — 1]. 

(a) Show that Theorem 8.8-3 holds for G a Martingale with respect to X. Specif- 
ically, substitute G for X in the statement of the theorem. Then make neces- 
sary changes to the proof. 

(b) Show that the Martingale convergence Theorem 8.8-4 holds for G a Martin- 
gale with respect to X. 

Consider the hypothesis-testing problem involving (n+1) observations X[0],..., X[n] 
of the random sequence X. Define the likelihood ratio 


4 fx(X[0],...,X[n]|H1) 


Pxll* F(X(0,-- X1nl |)’ "= 


corresponding to two hypotheses H, and Ho. Show that Lx[n] is a Martingale with 
respect to X under hypothesis Ho. 

In the discussion of interpolation in Example 8.4-7, work out the algebra needed to 
arrive at the psd of the up-sampled random sequence X-[n]. 

The up-sampled sequence X,[n]| in the interpolation process is clearly not WSS, even 
if X[n] is WSS. Create an up-sampled random sequence that is WSS by randomizing 
the start-time of the sequence X[n]. That is, define a binary random variable 0 
with P/O = 0] = P[O = 1] = 0.5. Define the start-time randomized sequence by 


X,([n] 4x [n + ©]. Then the resulting up-sampled sequence is Xe,(n] = X [252]. 
Show that Ryx,.x,[k] = Rx x[k] and Rx,,x.,./m,m+k] = Rx,,.x,,.[k] = 0.5Rx x [k/2] 


for k even, and zero for k odd. 
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P Random Processes 


In the last chapter, we learned how to generalize the concept of random variable to that 
of random sequence. We did this by associating a sample sequence with each outcome 
¢ € Q, thereby generating a family of sequences collectively called a random sequence. 
These sequences were indexed by a discrete (integer) parameter n is some index set Z. In this 
chapter we generalize further by considering random functions of a continuous parameter. 
We consider this continuous parameter time, but it could equally well be position, or angle, 
or some other continuous parameter. The collection of all these continuous time functions is 
called a random process. Random processes will be perhaps the most useful objects we study 
because they can be used to model physical processes directly without any intervening need 
to sample the data. Even when of necessity one is dealing with sampled data, the concept 
of random process will give us the ability to reference the properties of the sample sequence 
to those of the limiting continuous process so as to be able to judge the adequacy of the 
sampling rate. 

Random processes find a wide variety of applications. Perhaps the most common use 
is as a model for noise in physical systems, modeling of the noise being the necessary first 
step in deciding on the best way to mitigate its negative effects. A second class of applica- 
tions concerns the modeling of random phenomena that are not noise but are nevertheless 
unknown to the system designer. An example would be a multimedia signal (audio, image, 
or video) on a communications link. The signal is not noise, but it is unknown from the 
viewpoint of a distant receiver and can take on many (an enormous number of) values. Thus, 
we model such signals as random processes, when some statistical description of the source 
is available. Situations such as this arise in other contexts also, such as control systems, 
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pattern recognition, etc. Indeed from an information theory viewpoint, any waveform that 
communicates information must have at least some degree of randomness in it. 

We start with a definition of random process and study some of the new difficulties to 
be encountered with continuous time. Then we look at the moment functions for random 
processes and generalize the correlation and covariance functions from Chapter 8 to this 
continuous parameter case. We also look at some basic random processes of practical impor- 
tance. We then begin a study of linear systems and random processes. Indeed, this topic is 
central to our study of random processes and is widely used in applications. Then we present 
some classifications of random processes based on general statistical properties. Finally, we 
introduce stationary and wide-sense stationary random processes and their analysis for 
linear systems. 


9.1 BASIC DEFINITIONS 


It is most important to fully understand the basic concept of the random process and its 
associated moment functions. The situation is analogous to the discrete-time case treated 
in Chapter 8. The main new difficulty is that the time axis has now become uncountable. 
We start with the basic definition. 


Definition 9.1-1 Let (Q,.4% P) be a probability space. Then define a mapping X 
from the sample space 2 to a space of continuous time functions. The elements in this 
space will be called sample functions. This mapping is called a random process if at each 
fixed time the mapping is a random variable, that is, X(t,¢) € .7 | for each fixed t on the 
real line —co <t<+oo. 


Thus we have a multidimensional function X(t,¢), which for each fixed outcome ¢ 
is an ordinary time function and for each fixed t is a random variable. This is shown 
diagrammatically in Figure 9.1-1 for the special case where the sample space (2 is the 
continuous interval [0,10]. We see a family of random variables indexed by t when we look 
along the time axis, and we see a family of time functions indexed by ¢ when we look along 
the outcome “axis.” 

We have the following elementary examples of random processes: 


Example 9.1-1 
(simple process) X(t,¢) = X(C) f(t), where X is a random variable and f is a deterministic 
function of the parameter t. We also write X(t) = X f(t). 


Example 9.1-2 
(random sinewave) X(t,¢) = A(¢) sin(wot + O(C¢)), where A and © are random variables. 
We also write X(t) = Asin(wot + ©), suppressing the outcome ¢. 


More typical examples of random processes can be constructed from random sequences. 


+X € F is shorthand for {¢: X(¢) < «} C # for all x. This condition permits us to measure the 
probability of events of this kind and hence define CDFs. 
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X(t, ) 


Figure 9.1-1 A random process for a continuous sample space 22 = [0,10]. 


Example 9.1-3 
X(t) =o, X[n|pn(t— T[n]), where X[n] and T[n] are random sequences and the functions 
Pn(t) are deterministic waveforms that can take on various shapes. For example, the p,(t) 
might be ideal unit-step functions that could provide a model for a so-called jump process. 
In this interpretation the T[n] would be the times of the arrivals and the X[n] would be the 
amplitudes of the jumps. Then X(t) would indicate the total amplitude up to time ¢. If all 
the X|n]’s were 1, we would have a counting process in that X(t) would count the arrivals 
prior to time ft. 


If we sample the random process at n times t; through t,, we get an n-dimensional 
random vector. If we know the probability distribution of this vector for all times t, through 
t,, and for all positive n, then clearly we know a lot about the random process. If we know 
all this information, we say that we have statistically specified (statistically determined) 
the random process in a fashion that is analogous to the corresponding case for random 
sequences. 


Definition 9.1-2 A random process X(t) is statistically specified by its complete set 
of nth-order CDFs (pdf’s or PMFs) for all positive integers n, that is, Fy (#1, 22,...,2n; 
ty, to, ...,tn) for all 21, %2,...,@, and for all -co < ty <tg <...<thn<o. 


The term statistical comes from the fact that this is the limit of the information 
that could be obtained from accumulating relative frequencies of events determined by 
the random process X(t) at all finite collections of time instants. Clearly, this is all we 
could hope to determine by measurements on a process that we wish to model. However, 
the question arises: Is this enough information to completely determine the random process? 
Unfortunately the general answer is no. We need to impose a continuity requirement on the 
sample functions x(t). To see this the following simple example suffices. 
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Example 9.1-4 
(from Karlin [9-1]) Let U be a uniform random variable on [0,1] and define the random 
processes X(t) and Y(t) as follows: 


Af1 fort=U 
AMt)= a else, 


and 
Y(t) 20. for all t. 


Then Y(t) and X(t) will have the same finite-order distributions, yet obviously the proba- 
bility of the following two events is not the same: 


{X(t) < 0.5 for all ¢} 


and 
{Y(t) < 0.5 for all ¢}. 


To show that Y(t) and X(t) have the same nth-order pdf’s, find the conditional nth-order 
pdf of X given U = u, then integrate out the conditioning on U. We leave this as an exercise 
to the reader. 


The problem in Example 9.1-4 is that the complementary event {X(t) > 0.5} for 
some t € [0,1]} involves an wncountable number of random variables. Yet the statistical 
determination and the extended additivity Axiom 4 (see Section 8.1) only allow us to 
evaluate probabilities corresponding to countable numbers of random variables. In what 
follows, we will generally assume that we always have a process “continuous enough” that 
the family of finite-order distribution functions suffices to determine the process for all 
time.’ Such processes are called separable. The random process X(t) of the above example 
is obviously not separable. 

As in the case of random sequences, the moment functions play an important role in 
practical applications. The mean function, denoted by x(t), is given as 


Ux(t)= E[X()],  —o <t<-+too. (9.1-1) 
Similarly the correlation function is defined as the expected value of the conjugate product, 
Rxx(ti,to) 2 E[X(t1)X*(t2)],  —00 < ti, tp < too. (9.1-2) 


The covariance function is defined as the expected value of the conjugate product of the 
centered process X(t) S X(t) — x(t) at times t, and to: 


& 


Kxx(ti, te) = E[Xe(t1) XZ (t2)] 


E\(X (t1) — wx (t1))(X (ta) — bx (t2))"T- 


(9.1-3) 


I> 


+ An exception is white noise to be introduced in Section 9.3. 
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Clearly these three functions are not unrelated and in fact we have, 
Kxx (ti, ta) = Rxx (th, ta) — wx (t1) wx (2). (9.1-4) 


We also define the variance function as 0% (t) 4 Kxx(t,t) =E [|X-(t)|?], and the power 
function Rxx(t,t) = E [|X(t)/]. 


Example 9.1-5 
(more on random sinewave) Consider the random process 


X(t) = Asin(wot + 0), 


where A and © are independent, real-valued random variables and 9 is uniformly distributed 
over [—7, +7]. For this sinusoidal random process, we will find the mean function jvx(t) and 
correlation function Rx x (ti, t2). First 


[ix (t) = E[Asin(wot + 0)] 
= E/AJE|sin(wot + O)] 


LT 
= [la: = | sin(wot + 0)d0 
20 Ja 
Then for the correlation, 
Rx x(t, t2) = E[X(t1)X*(t2)] 
= E[A? sin(wot, + ©) sin(wots + ©) 


l| 


E[A?]E[sin(wot, + 9) sin(wot2 + 9). 
Now, the second factor can be rewritten as 
3{E|cos(wo(ti — t2))] — E[cos(wo(t1 + tz) + 20)]} (9.1-5) 
by applying the trigonometric identity 
sin(B) sin(C) = ${cos(B — C) — cos(B + C)}, 


and bringing the expectation operator inside. Then, since © is uniformly distributed over 
[—7, +7], the integral arising from the second expectation in Equation 9.1-5 is zero, and we 
finally obtain 

Rxx(ti,tz) = 3E[A"] coswo(ti — ta). 


We note that w(t) = 0 (a constant) and Rx x (ti, t2) depends only on t; —tz. Such processes 
will be classified as wide-sense stationary in Section 9.4. 
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As in the discrete-time case, the correlation and covariance functions are Hermitian 
symmetric, that is, 


Rxx(ti,te) = Ry x (te, th), 
Kxx (ti, ta) = KX x (te, tr), 


which directly follow from the linearity of the expectation operator E. 

If we sample the random process at N times t;,t2,...,t~, we form a random vector. 
We have already seen that the correlation or covariance matrix of a random vector must 
be positive semidefinite (cf. Chapter 5). This, then, imposes certain requirements on the 
respective correlation and covariance function of the random process. Specifically, every 
correlation (covariance) matrix that can be formed from a correlation (covariance) function 
must be positive semidefinite. We next define positive semidefinite functions. 


Definition 9.1-3 The two-dimensional function g(t, s) is positive semidefinite if for 


all N > 0, and all ty < te <... < ty, and for all complex constants a1, a2,..., an, we have 
N N 
Sy aiajoGt) > 0. Bw 
i=1 j=l 


Using this definition, we can thus say that all correlation and covariance functions must 
be positive semidefinite. Later we will see that this necessary condition is also sufficient. 
Although positive semidefiniteness is an important constraint, it is difficult to apply this 
condition in a test of the legitimacy of a proposed correlation function. 

Another fundamental property of correlation and covariance functions is diagonal 
dominance, 


|Rxx(t,s)| < /Rxx(t,t)Rxx(s,s) for all t,s, 


which follows from the Cauchy—Schwarz inequality (cf. Equation 4.3-17). Diagonal domi- 
nance is implied by positive semidefiniteness but is a much weaker condition. 


9.2 SOME IMPORTANT RANDOM PROCESSES 


In this section we introduce several important random processes. We start with the asyn- 
chronous binary signaling (ABS) process and the random telegraph signal (RTS). We 
continue with the Poisson counting process; the phase-shift keying (PSK) random process, 
an example of digital modulation; the Wiener process, which is obtained as a contin- 
uous limit of a random walk sequence; and lastly introduce the broad class of Markov 
processes. 


Asynchronous Binary Signaling 


A sample function of the asynchronous binary signaling (ABS) process (important for digital 
modulation and computers) is shown in Figure 9.2-1. Each pulse has width T with the 
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X(t) 


Figure 9.2-1 Sample function realization of the asynchronous binary signaling (ABS) process. (Plotted 
for D= 0.) 


random variable X,, indicating the height of the nth pulse, taking on values +a with equal 


probability. 
The sequence is asynchronous because the start time of the nth pulse or, equivalently, 
the displacement D of the Oth pulse is a uniform random variable U(—F, 4). For |t2—t)| < 


T, the sampling instant t2 could be on the same pulse containing the sampling instant t, 
or on a different pulse. 
The ABS process can thus be described pene by 


—nT 
Xn 
-Dxw [A], 
where the pulse (rectangular window) function w(t) is defined as 


A fl for |t]) <3 
0 else. 


The correlation function for this real-valued process is given as 


Rxx (ti, te) = E[X(t1)X (t2)] 


t; -D-nT tg -D-IT 
yy aw A) w(2=3-*) 
n l 


In the ABS process it is assumed the levels of different pulses are independent random 
variables and that these, in turn, are independent of the random displacement D. Since 
E[X, Xi] = E[X,]E[X1] for n 4 1 and E[X?2] = a?, we obtain 


Rs oD w = 7 ea) 
+ Dax lem of (S=2="2) y (B22) 


nZzl | 
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Now, the second term on the right, the one involving the n 4 | products, is zero because 
E|X,] = E[X;)] = 0. Also 


Sele (sas) (Ss) 


+2 
"Be 


Caf 0 VCH) « 
= (1 ap) w (254) for t2 > t. 


More generally, and for 7 = 4 ty -t$ S0, we can write that 


Rxx(tT) =a? ( — a) w (s7) (9.2-1) 
since w(|T|) = w(7). 


Equation 9.2-1 is directly extended to the case of equiprobable transitions between two 
arbitrary levels, say a and b. The required modification is 


a: ‘ |\7| T at+b\? 
Rxx(T) = Gla b) (1 7) w(5)+(H ) 
We leave the derivation of this result as an exercise for the reader. In Figure 9.2-2 we show 
the ABS correlation function Rx x(r) fora=1,b=0, and T=1. 


Poisson Counting Process 


Let the process N(t) represent the total number of counts (arrivals) up to time t. Then we 


can write 
Co 
‘2 Le 


Rxx (7) 


Figure 9.2-2 Autocorrelation function of ABS random process for a= 1,b=0 and T=1. 
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n(t) 


yO wo Ff OF 


Figure 9.2-3 A sample function of the Poisson process running on [0, oo). 


where u(t) is the unit-step function and T[n], the time to the nth arrival, is the random 
sequence of times considered in Example 8.1-11. There we showed that the T[n] obeyed the 
nonstationary first-order Erlang density, 


= ead), we 0, (9.2-2) 


which was obtained as an n-fold convolution of exponential pdf’s. A typical sample function 
is shown in Figure 9.2-3, where T[n] = t, and T[n] = T,. Note that the time between the 
arrivals, 


T|[n] = T[n| — T[n — 1], 
the interarrival times, are jointly independent and identically distributed, having the expo- 
nential pdf, 


fr(t) = Ae ul), 
as in Example 8.1-11. Thus, Tn] denotes the total time until the nth arrival if we begin 
counting at the reference time t = 0. 


Now by the construction involving the unit-step function, the value N(t) is the number 
of arrivals up to and including time t, so 


P(N(t) =n) =P[T[n] <t,T[n+1]>4, 
because the only way that N(t) can equal n is if the random variable T[n] is less than or 
equal to ¢ and the random variable T[n + 1] is greater than t. If we bring in the independent 


interarrival times, we can re-express this probability as 


P[T[n] <t,7[n +1 >t—Thnl], 
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which can be easily calculated using the statistical independence of the arrival time T[n] 
and the interarrival time T[n + 1] as follows: 


eS ” f,(B)a8 da = —— * yeaa da- u(t) 
0 ta 0 (n-1) ta 


Py(n;t) = ——e- u(t) for t > 0, n> 0. (9.2-3) 


We have thus arrived at the PMF of the Poisson counting process and we note that it’s 
equal to that of a Poisson random variable (cf. Equation 2.5-13, see also Equation 1.10-5) 
with mean p = At is 

E|N(#)] = At. (9.2-4) 


We call \ the mean arrival rate (also sometimes called intensity). It is intuitively satisfying 
that the average value of the process at time ¢ is the mean arrival rate \ multiplied by the 
length of the time interval (0, ¢]. We leave it as an exercise for the reader to consider why 
this is so. 

Since the random sequence T'[n| has independent increments (cf. Definition 8.1-4) and 
the unit-step function used in the definition of the Poisson process is causal, it seems reason- 
able that the Poisson process N(t) would also have independent increments. However, this 
result is not clear because one of the jointly independent interarrival times T[n] may be 
partially in two disjoint intervals, hence causing a dependency in neighboring increments. 
Nevertheless, using the memoryless property of the exponential pdf (see Problem 9.8), one 
can show that the independent-increments property does hold for the Poisson process. 

Using independent increments we can evaluate the PMF of the increment in the Poisson 
counting process over an interval (ta, t») as 


(ty — ta)|” 
PIN (ts) — N(tq) =n] = Ae tell” rte —tedeu(n), (9.2.5) 
n! 
where we have used the fact that the interarrival sequence is stationary, that is, that A is a 
constant. We formalize this somewhat in the following definition. 


Definition 9.2-1 A random process has independent increments when the set of n 
random variables, 
X (ty), X(t2) — X(t1),...,X (tn) — X(tn-1), 


d 


are jointly independent for all t) <t2 <...<t, andforaln>1. 


This just says that the increments are statistically independent when the corresponding 
intervals do not overlap. Just as in the random sequence case, the independent-increment 
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property makes it easy to get the higher-order distributions. For example, in the case at 
hand, the Poisson counting process, we can write for tz > ty, 


Py (ni, ng; t1,t2) = P[N(t1) = mi] PIN(t2) — N(ti) = ne — na] 


ny! 7 (nz — 11)! 7 u(ni)u(n2 — m1), 


which simplifies to 
Da (te _ i," ase 


Py (m1, naj ti, t2) = my!(nz — n1)! : 


u(n1)u(nz — 1), 0<t; < ty. 
See also Problem 1.54. Using the independent-increments property we can formulate the 
following alternative definition of a Poisson counting process. 


Definition 9.2-2 A Poisson counting process is the independent-increments process 
whose increments are Poisson distributed as in Equation 9.2-5. [yj 


Concerning the moment function of the Poisson process, the first-order moment has 
been shown to be At. This is the mean function of the process. Letting tg > t;, we can 
calculate the correlation function using the independent-increments property as 


E|N(t2)N(t1)] = E[(N (ti) + [N (ta) — Nt) N (41) 
= E[N?(t,)] + E[N(t2) — N(t1)|E[N (t1)] 
= Aty + A749 + Alte — ty) Ati 


= Ai es. 


If tg < t,, we merely interchange t, and t2 in the preceding formula. Thus the general result 
for all t; and tg is 


Ryn (ti, tz) = E[N(t1) N(t2)] 


(9.2-6) 
= Amin(ty, ta) + eee 
If we evaluate the covariance using Equations 9.2-4 and 9.2-6 we obtain 
Knwn (1, ta) = Amin(ty, ta). (9.2-7) 


We thus see that the variance of the process is equal to At and is the same as its mean, 
a property inherited from the Poisson random variable. Also we see that the covariance 
depends only on the earlier of the two times involved. The reason for this is seen by writing 
N(t) as the value at an earlier time plus an increment, and then noting that the independence 
of this increment and N(t) at the earlier time implies that the covariance between them must 
be zero. Thus, the covariance of this independent-increments process is just the variance of 
the process at the earlier of the two times. 


Example 9.2-1 
(radioactivity monitor) In radioactivity monitoring, the particle-counting process can often 
be adequately modeled as Poisson. Let the counter start to monitor at some arbitrary time ¢ 
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and then count for Tp seconds. If the count is above a threshold, say No, an alarm will be 
sounded. Assuming the arrival rate to be 4, we want to know the probability that the alarm 
will not sound when radioactive material is present. 

Since the process is Poisson, we know it has independent increments that satisfy the 
Poisson distribution. Thus the count AN in the interval (t,¢ + To], that is, AN & 
N(t +7) — N(#), is Poisson distributed with mean AT7p independent of t. The probability 
of No or fewer counts is thus 


No k 
P|AN<™|=)>~ a") e >To 
k=0 : 


If No is small we can calculate the sum directly. If \7p> >> 1, we can use the Gaussian 
approximation (Equation 1.11-9) to the Poisson distribution. 


Example 9.2-2 
(sum of two independent Poisson processes) Let Ni(t) be a Poisson counting process with 
rate \,. Let No(t) be a second Poisson counting process with rate Az, where No is inde- 


pendent of N,. The sum of the two processes, N(t) & N,(t) + No(t), could model the 
total number of failures of two separate machines, whose failure rates are A; and Xo, 
respectively. It is a remarkable fact that N(t) is also a Poisson counting process with rate 
A= Ai + 2. 

To see this we use Definition 9.2-2 of the Poisson counting process and verify these 
conditions for N(t). First, it is clear with a little reflection that the sum of two independent- 
increments processes will also be an independent-increments process if the processes are 
jointly independent. Second, for any increment N(ty) — N(ta) with ty > ta, we can 
write 

N(to) — N(ta) = Ni (te) — Ni(ta) + No(ty) — No(ta)- 


Thus the increment in N is the sum of two corresponding increments in N; and No. The 
desired result then follows from the fact that the sum of two independent Poisson random 
variables is also Poisson distributed with parameter equal to the sum of the two parameters 
(cf. Example 3.3-8). Thus the parameter of the increment in N(t) is 


Ai(ty — ta) + Ao(te — ta) = (Ar + A2) (te — ta) 


as desired. 


The Poisson counting process N(t) can be generalized in several ways. We can let the 
arrival rate be a function of time. The arrival rate \(t) must satisfy A(t) > 0. The average 
value of the resulting nonuniform Poisson counting process then becomes 


[ix (t) = - A(r)dr, t>0. (9.2-8) 


The increments then become independent Poisson distributed with increment means deter- 
mined by this time-varying mean function. Another possible generalization is to two-dimensional 
or spatial Poisson processes that are used to model photon arrival at an image sensor, defects 

on semiconductor wafers, etc. 


Sec. 9.2. SOME IMPORTANT RANDOM PROCESSES 555 


Alternative Derivation of Poisson Process 


It may be interesting to rederive the Poisson counting process from the elementary properties 
of random points in time listed in Chapter 1, Section 1.10. They are repeated here in a 
notation consistent with that used in this chapter. For At small: 


(1) Py(1st,t + At) = A(t)At + o( At). 


(2) Pn(k;t,t + At) = o(At), ke 1. 
(3) Events in nonoverlapping time intervals are statistically independent. 


Here the notation o(At), read “little oh,” denotes any quantity that goes to zero at a 


o] 


faster than linear rate in such a way that 


. o(At) 
fesse At 
and Py(k;t,t + At) = P[N(t+ At) — N(t) = kl. 
We note that property (3) is just the independent-increments property for the counting 
process N(t) which counts the number of events occurring in (0, ¢]. 
We can compute the probability Py(k;t,t+7) of k events in (t,t-+7) as follows. Consider 
Py (k;t,t+7 + At); if At is very small, then in view of properties (1) and (2) there are only 
the following two possibilities for getting k events in (t,t +7 + At): 


= 0, 


E, = {k in (t,t +7) and 0 in (t+7,¢++7+4 At)} or 


BE, ={k—1 in (¢,t+7) and 1 in ((+7,t+7+4 At)}. 
Since events FE, and £2 are disjoint events, their probabilities add and we can write 


Py(k;t,t +7 + At) = Py(k;t,t+ 7) Py (0;t+7,t +7 + At) 


+ Py(k—1;t,t+7)Py(1ljt+7,t+7 +4 At) 


= Py(k;3t,t+7)[1 — A(t + T)At] 
Py(k—1;t,t+7)A\(t+ 7)At. 


If we rearrange terms, divide by At, and take limits, we obtain the linear differential 
equations (LDEs), 
dPy(k;t,t +7) 
dt 


= X(t + 7)[Pw(k -1;t,t +7) — Pu(k3t,t+7)]. 


Thus, we obtain a set of recursive first-order differential equations from which we can solve 
for Py (k;t,t+7),k =0,1,.... Weset Py(—1;t,t+7) = 0, since this is the probability of the 
impossible event. Also, to shorten our notation, we temporarily write Py (k) 4 Py (k;t, t+7); 
thus the dependences on ¢ and 7 are submerged but of course are still there. 
When k=0, 
dPy (0) 
dt 


= —A(t-+ 7) Py(0). 
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This is a simple first-order, homogeneous differential equation for which the solution is 
t+T 
Py(0) = Cexp |- / N(e)ae| 
t 
Since Py (0;t,t) = 1,C = 1 and 


Py(0) = exp |- | - (e)as| , 


Let us define yz by 


thr 
A 
pS [reas 
t 
Then 
Py(0) =e". 
When k =1, the differential equation is now 
dPy (1) 
——— + X(t + 7)Pn(1) = A(t+ 7) Pn (0 
AAD + Att 7)Pw(L) = AE+ 7) Pv(0) ies 
=X(t+r)e". 


This elementary first-order, inhomogeneous equation has a solution that is the sum of the 
homogeneous and particular solutions. For the homogeneous solution, P;,, we already know 
from the k = 0 case that 

P, h= Coe". 


For the particular solution P, we use the method of variation of parameters to assume that 
Py, = v(t+T)e, 


where u(t +7) is to be determined. By substituting this equation into Equation 9.2-9 we 
readily find that 

Pp = pe". 
The complete solution is Py (1) = P, + Pp. Since Py (1; t,t) = 0, we obtain C2 = 0 and thus 


Py(1) = pe’. 


General case. The LDE in the general case is 
dPn(k) 

dt 
and, proceeding by induction, we find that 


uk 
Py(k) =e" =k =0,1,... 


which is the key result. Recalling the definition of , we can write 


Py(kit,t +7) = x | ‘. * xgag] exp |- | “Nea (9.2-10) 


We thus obtain the nonuniform Poisson counting process. 
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X(t) 


Figure 9.2-4 Sample function of the random telegraph signal. 


Another way to generalize the Poisson process is to use a different pdf for the indepen- 
dent interarrival times. With a nonexponential density, the more general process is called a 
renewal process {9-2|. The word “renewal” can be related to the interpretation of the arrival 
times as the failure times of certain equipment; thus the value of the counting process N(t) 
models the number of renewals that have had to be made up to the present time. 


Random Telegraph Signal 


When all the information in a random waveform is contained in the zero crossings, a so- 
called “hard clipper” is often used to generate a simpler yet equivalent two-level waveform 
that is free of unwanted random amplitude variation. A special case is when the number of 
zero crossings in a time interval follows the Poisson law, and the resulting random process 
is called the random telegraph signal (RTS). A sample function of the RTS is shown in 
Figure 9.2-4. 

We construct the RTS on t > 0 as follows: Let X(0) = +a with equal probability. 
Then take the Poisson arrival time sequence T[n] of Chapter 8 and use it to switch the 
level of the RTS; that is, at T[1] switch the sign of X(t), and then at T[2], and so forth. 
Clearly from the symmetry and the fact that the interarrival times T[n] are stationary and 
form an independent random sequence, we must have that x(t) = 0 and that the first- 
order PMF Px(a) = Px(—a) = 1/2. Next let t2 > t; > 0, and consider the second-order 
PMF Px(21,22) 2 P[X(t1) = 21,X(tz) = x2] along with Px (x2 | 21) 2 P[X(t2) = a2 
| X(t1) = a]. Then we can write the correlation function as 


Rxx(t, ta) = E[X(t1)X(t)] 
= a’ Px(a,a) + (—a)?Px(—a, —a) + a(—a) Px (a, —a) — a(a)Px(—a, a) 
1 


= 50 (Px(ala) + Px( a| — a) — Px(—ala) — Px(a|—a)), 


since Px(a) = Px(—a) = 1/2. But Px(—a| — a) = Px(ala) is just the probability of an 
even number of zero crossings in the time interval (t,,t2], while Px(—ala) = Px(a| — a) is 
the probability of an odd number of crossings of 0. Hence, writing the average number of 


transitions per unit time as A, and substituting 7 = to — t1, we get 
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Rxx(t, t2) ~ a Ds oo Sel > py a Ar)" = we > (—1)* On 


even k>0 odd k>0 : all k>0 


where we have combined the two sums by making use of the function (—1)*, since (—1)* = 1 


for k even and (—1)* = —1 for k odd. Thus we now have 
—Xr)F 
Rxx (ti, ta) _ aze7T » ( a ) = aze72T 
all k>0 


for the case when 7 > 0. Since the correlation function of a real-valued process must be 
symmetric, we have Ry x(ti,t2) = Rxx(te,t1), so that when 7 < 0, we can substitute —r 
into the above equation to get Rx x(t1,t2) = a2et?47. Thus overall we have, valid for all 
interval lengths 7, 

Rx x (ti, te) = ae Al7l, 


A plot of this correlation function is shown in Figure 9.2-5. 


Digital Modulation Using Phase-Shift Keying 


Digital computers generate many binary sequences (data) to be communicated to other 
digital computers. Often this involves some kind of modulation. Binary modulation methods 
frequency-shift these data to a region of the electromagnetic spectrum which is well suited 
to the transmission media, for example, a telephone line. A basic method for modulating 
binary data is phase-shift keying (PSK). In this method binary data, modeled by the random 
sequence B[n|, are mapped bit-by-bit into a phase-angle sequence O[n], which is used to 
modulate a carrier signal cos(27 fet). 


Ryxlty, to) 


1.5 


0.5 


10 -8 -6 -4 -2 0 2 4 6 8. 10 
T=t-t 


Figure 9.2-5 The symmetric exponential correlation function of an RTS process (a = 2.0, A = 0.25). 
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B i) x 
[nl | Angle [nl | cos (t) 


generator generator 


Figure 9.2-6 System for PSK modulation of Bernoulli random sequence Bin]. 


Specifically let B[n] be a Bernoulli random sequence taking on the values 0 and 1 with 
equal probability. Then define the random phase sequence O[n] as follows: 


tn/2 if B[n] =1, 
an] 2 {#7 if B[n] = 0. 


Using 0,(t) to denote the analog angle process, we define 


— 


@,(t) 2 O[k] for kT <t<(k+DT, 


and construct the modulated signal as 
X(t) = cos(27 fet + Oa(t)). (9.2-11) 


Here JT is a constant time for the transmission of one bit. Normally, T’ is chosen to be 
a multiple of 1/f, so that there are an integral number of carrier cycles per bit time T. 
The reciprocal of T is called the message or baud rate. The overall modulator is shown in 
Figure 9.2-6. The process X(t) is the PSK process. 

Our goal here is to evaluate the mean function and correlation function of the random 
PSK process. To help in the calculation we define two basis functions, 


A {cos(Q7f.t) O<t<T 
024 . else 


and 


I> 


ag) eae Ue aa & 


0 else, 


which together with Equation 9.2-11 imply 


cos|27 ft + Oa(t)] = cos(Oq(t)) cos 27 fet — sin(Oq(t)) sin 27 fet 


ay +00 - 
=> scOMat-a— 5° anSiagenan, 9 
k=—00 nell 


by use of the sum of angles formula for cosines. 

The mean of X(t) can then be obtained in terms of the means of the random sequences 
cos(O[n]) and sin(@[n]). Because of the definition of O[n], in this particular case cos(O[n]) = 
0 and sin(@[n]) = £1 with equal probability so that mean of X(t) is zero, that is, w(t) = 0. 
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Using Equation 9.2-12 we can calculate the correlation function 
Rxx(ti,te) = )> E{sin ©[k] sin O[l] }sq(t1 — kT) sQ(t2 — IT), 
kl 
which involves the correlation function of the random sequence sin(O[n]), 
Rsin@,sine [k, I] = o[k ~~ I]. 


Thus the overall correlation function then becomes 
+00 


Rxx(t,te)= S > se(ti—kT)sq(te — kT). (9.2-13) 


k=—0o 
Since the support of sg is only of width T, there is no overlap in (t1,t2) between product 
terms in Equation 9.2-13. So for any fixed (t;,t2), only one of the product terms in the sum 
can be nonzero. Also if t; and tg are not in the same period, then this term is zero also. 
More elegantly, using the notation, 
(t) 2¢modT and |t/T | & integer part (t/T), 
we can write that 


Rxxttitg) = {°)s—(Ca) for L/P = Ua 


In particular for 0 < t,; < T and 0 < te < T, we have 


Rxx(t1,t2) = se(t1)sa(t2). 


Wiener Process or Brownian Motion 


In Chapter 8 we considered a random sequence X[n] called the random walk in 
Example 8.1-13. Here we construct an analogous random process that is piecewise constant 
for intervals of length T' as follows: 


I> 


Xr(t) = > Wiku(t -— kT), 


k=1 


CoO 


where 
A jft+s with p=0.5 
Wiel {* with p = 0.5 


and u(t) is the continuous unit step function. 
Then X7(nT) = X[n] the random-walk sequence, since 


Xq(nT) = 5° W[k] = Xr]. 
k=1 


Hence we can evaluate the PMFs and moments of this random process by employing the 
known results for the corresponding random-walk sequence. Now the Wiener’ process, 


+ After Norbert Wiener, American mathematician (1894-1964), a pioneer in communication and estima- 
tion theories. 
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sometimes also called Wiener—Levy or Brownian motion, is the process whose distribution 
is obtained as a limiting form of the distribution of the above piecewise constant process as 
the interval T’ shrinks to zero. We let s, the jump size, and the interval T shrink to zero in 
a precise way to obtain a continuous random process in the limit, that is, a process whose 
sample functions are continuous functions of time. In letting s and T tend to zero we must 
be careful to make sure that the limit of the variance stays finite and nonzero. The resulting 
Wiener process will inherit the independent-increments property. 

The original motivation for the Wiener process was to develop a model for the chaotic 
random motion of gas molecules. Modeling the basic discrete collisions with a random walk, 
one then finds the asymptotic process when an infinite (very large) number of molecules 
interact on an infinitesimal (very small) time scale. 

As in Example 8.1-13, we let n be the number of trials, & be the number of successes, 


and n —k be the number of failures. Also r 2 k — (n — k) = 2k — n denotes the excess 
number of successes over failures. Then 2k = n+r or k = (n+1r)/2 and must be an 
integer; you cannot have 2.5 “successes.” Thus, »-+r must be even and the probability that 
X7(nT) = rs is the probability that there are 0.5(n + r) successes (+s) and 0.5(n — r) 
failures (—s) out of a total of n trials. Thus by the binomial PMF, 


n 


P[X7(nT) =rs] = ( a) 2°" for n+ r even. 


2 


Ifn+r is odd, then X7(nT) cannot equal rs. 
The mean and variance can be most easily calculated by noting that the random variable 
X(n] is the sum of n independent Bernoulli random variables defined in Section 8.1. Thus 


E([X7(nT)] =0 


and 
E[X2.(nT)] = ns?. 


On expressing the variance in terms of t = nT’, we have 


sg? 


Var[X7r(t)] = E[X#(nT)] = tr 


Thus we need s? proportional to J to get an interesting limiting distribution.’ We set 
s* = aT, where a > 0. Now as T goes to zero we keep the variance constant at at. Also, by 
an elementary application of the Central Limit theorem (cf. Section 4.7), we get a limiting 
Gaussian distribution. We take the limiting random process (convergence in the distribution 
sense) to be an independent-increments process since all the above random-walk processes 
had independent increments for all 7’, no matter how small. Hence we arrive at the following 


specification for the limiting process, which is termed the Wiener process: 
[x (t) = 0, Var[X (t)] = at 
+The physical implication of having s? proportional to T is that if we take v = s/T as the speed of 


the particle, then the particle speed goes to infinity as the displacement s goes to zero such as to keep the 
product of the two constant. 
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and 


fx(;t) = = ) ,  t>0. (9.2-14) 


1 
ex 
Vv 2rat »( 2at 


The pdf of the increment A & X(t) — X(r) for all t > 7 is given as 


fa(o;t—7) = —= = exp ( walt =) ; (9.2-15) 
since 
E|X(t) — X(7)] = E[A] = 0, (9.2-16) 
and 
E [(X(t)— X(r))?] =a(t—7) fort >r. (9.2-17) 


Example 9.2-3 
(sample functions) We can use MATLAB to visually investigate the sample functions typical 
of the Wiener process. Since it is a computer simulation, we also can evaluate the effect of 
the limiting sequence occurring as s = VaT approaches 0 for fixed a > 0. 

We start with a 1000-element vector that is a realization of the Bernoulli random vector 
W with p= 0.5 generated as 


u = rand(1000,1) 
w=0.5 >=u 


The following line then converts the range of w to +s for a prespecified value s: 
w = s*(2*w - 1.0) 


and then we generate a segment of a sample function of X7(nT) = X[n] as elements of the 
random vector 


x = cumsum(w) 


For the numerical experiment let a = 1.0 and set T = 0.01 (s = 0.1). Using a computer 
variable x with dimension 1000 for T = 0.01, we get the results shown in Figure 9.2-7. Note 
particularly in this near limiting case, the effects of increasing variance with time. Also note 
that trends or long-term waves appear to develop as time progresses. 


From the first-order pdf of X and the density of the increment A, it is possible to 
calculate a complete set of consistent nth-order pdf’s as we have seen before. It thus follows 
that all nth-order pdf’s of a Wiener process are Gaussian. 


Definition 9.2-3 If for all positive integers n, the nth-order pdf’s of a random process 
are all jointly Gaussian, then the process is called a Gaussian random process. [j 


The Wiener process is thus an example of a Gaussian random process. The covariance 
function of the Wiener process (which is also its correlation function because y(t) = 0) is 
given as 

Kx x(t, te) = amin(ty, ta), a> 0. (9.2-18) 
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Figure 9.2-7 A Wiener process sample function approximation for @ = 1 calculated with T = 0.01. 


To show this we take ¢t; > tz, and noting that the (forward) increment X(t1) — X(t2) is 
independent of X(t2) and that they both have zero mean, 


E\(X (t1) — X (t2)) X (ta)] = E[X (ti) — X (te) | E[X (t2)] 
=0 


E(X (t1)X (t2)] = ELX?(t2)] 
= ats. 


If tg > ti, we get E[X(t2)X(t1)] = ati, thus establishing Equation 9.2-18. 

Note that the Wiener process has the same variance function as the Poisson process, 
even though the two processes are dramatically different. While the Poisson process consists 
solely of jumps separated by constant values, the Wiener process has no jumps and can in 
fact be proven to be a.s. continuous; that is, the sample functions are continuous with 
probability 1. Later, we will show that the Wiener process is continuous in a weaker mean- 
square sense (specified more precisely in Chapter 10). 


Markov Random Processes 


We have discussed five random processes thus far. Of these, the Wiener and Poisson are 
fundamental in that many other rather general random processes have been shown to be 
obtainable by nonlinear transformations on these two basic processes. In both cases, the 
difficulty of specifying a consistent set of nth-order distributions from processes with depen- 
dence was overcome by use of the independent-increments property. In fact, this is quite a 
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general approach in that we can start out with some arbitrary first-order distribution and 
then specify a distribution for the increment, thereby obtaining a consistent set of nth-order 
distributions that exhibit dependence. 

Another way of going from the first-order probability to a consistent set of nth-order 
probabilities, which has proved quite useful, is the Markov process approach. Here we start 
with a first-order density (or PMF) and a conditional density (or conditional PMF) 


fx(a;t) and fx (x2|21; te, t1), to > Et 


and then build up the nth-order pdf f(a1,...,2njt1,..-,tn) (or PMF) as the product, 


F (a1; t)f (v2|21; ta, t1) -.- f(@n|an—15 tn, tn_1)- (9.2-19) 


We ask the reader to show that this is a valid nth-order pdf (i.e., that this function is 
nonnegative and integrates to one) whenever the conditional and first-order pdf’s are well 
defined. 

Conversely, if we start with an arbitrary nth-order pdf and repeatedly use the definition 
of conditional probability we obtain, 


F(a1,. ++, 2nj3ti,..+:tn) = f(t; t1) f (2213 ta, t1) f (@3|"2, 21; t3,t2,t1) x 


(9.2-20) 
ie Xf (Oy Oy aiges 4 Cis tayeoes th) 


which can be made equivalent to Equation 9.2-19 by constraining the conditional densities to 
depend only on the most recent conditioning value. This motivates the following definition 
of a Markov random process. 


Definition 9.2-4 (Markov random process) 


(a) A continuous-valued (first-order) Markov process X(t) satisfies the conditional 
PMF expression 


Fx Galen Aaa) ae ,L13tn, ae , ti) = Fx Pani te tnad), 


for all 41, %2,...,%n, for all t) < tg <...< ty, and for all integers n > 0. 
(b) A discrete-valued (first-order) Markov random process satisfies the conditional PMF 
expression 


Px (2p|En—1,---,213;tn,---,t1) = Px(@n|en—-13 tn; tn—1) 
for all v,...,%n, for all t) <...<t,, and for allintegersn>0. 


The value of the process X(t) at a given time t thus determines the conditional proba- 
bilities for future values of the process. The values of the process are called the states of the 
process, and the conditional probabilities are thought of as transition probabilities between 
the states. If only a finite or countable set of values x; is allowed, the discrete-valued Markov 
process is called a Markov chain. An example of a Markov chain is the Poisson counting 
process studied earlier. The Wiener process is an example of a continuous-valued Markov 
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process. Both these processes are Markov because of their independent-increments property. 
In fact, any independent-increment process is also Markov. To see this note that, for the 
discrete-valued case, for example, 


Px (Gp|Sy—i5+ ++) Bij bay-++5 i) 
Ln|X (tn—1) = Ln—1,---,X(t1) = £1] 
Xba 1) =2@n — Ln— 1|X (tn— i) = fn- 1,---,X(t1) = 21] 


= P[X(t,) = 
(tn) — 

= P[|X(tn) — X(tn—-1) = &@n — Ln—1|] by the independent-increments property 
(tn) — 
(tn) = 


=P|X (tn 


= P[|X(tn) — X(tn-1) = @n — En—1|X(tn-1) = Xn—-1] again by independent increments 


= P|X tx Ln|X (tr 1) = fn— 1] 


= Px (&n|@n—1; teeity) 


Note, however, that the inverse argument is not true. A Markov random process does not 
necessarily have independent increments. (See Problem 9.17.) 

Markov random processes find application in many areas including signal processing, 
communications, and control systems. Markov chains are used in communications, computer 
networks, and reliability theory. 


Example 9.2-4 
(multiprocessor reliability) Given a computer with two independent processors, we can 
model it as a three-state system: 0—both processors down; 1—exactly one processor up; 
and 2—both processors up. We would like to know the probabilities of these three states. A 
common probabilistic model is that the processors will fail randomly with time-to-failure, 
the failure time, exponentially distributed with some parameter A > 0. Once a processor 
fails, the time to service it, the service time, will be assumed to be also exponentially 
distributed with parameter 4 > 0. Furthermore, we assume that the processor’s failures and 
servicing are independent; thus we make the failure and service times in our probabilistic 
model jointly independent. 

If we define X(t) as the state of the system at time t, then X is a continuous-time 
Markov chain. We can show this by first showing that the times between state transitions 
of X are exponentially distributed and then invoking the memoryless property of the expo- 
nential distribution (see Problem 9.8). Analyzing the transition times (either failure times 
or service times), we proceed as follows. The transition time for going from state X = 0 to 
X =1 is the minimum of two exponentially distributed service times, which are assumed 
to be independent. By Problem 3.26, this time will be also exponentially distributed with 
parameter 24. The expected time for this transition will thus be 1/(2u) = $(1/p), that 
is, one-half the average time to service a single processor. This is quite reasonable since 
both processors are down in state X = 0 and hence both are being serviced indepen- 
dently and simultaneously. The rate parameter for the transition 0 to 1 is thus 2u. The 
transition 1 to 2 awaits one exponential service time at rate yw. Thus its rate is also p. 
Similarly, the state transition 1 to 0 awaits only one failure at rate A, while the transition 
2 to 1 awaits the minimum of two exponentially distributed failure times. Thus its rate 
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Figure 9.2-8 Short-time state-transition diagram with indicated transition probabilities. 


is 2A. Simultaneous transitions from 0 to 2 and 2 to 0 are of probability 0 and hence are 
ignored. 

This Markov chain model is summarized in the short-time state-transition diagram of 
Figure 9.2-8. In this diagram the directed branches represent short-time, that is, as At > 0, 
transition probabilities between the states. The transition times are assumed to be expo- 
nentially distributed with the parameter given by the branch label. These transition times 
might be more properly called intertransition times and are analogous to the interarrival 
times of the Poisson counting process, which are also exponentially distributed. 

Consider the probability of being in state 2 at t+ At, having been in state 1 at time t. 
This requires that the service time T; lies in the interval (¢,¢ + At] conditional on T, > t. 


Let P,(t) 2 P[X(t) = é] for 0 < i <2. Then 

P,(t + At) = P,(t)Pit < T, <t+ Ad|T, > d], 
where 
Fr,(t + At) — Fr, (t) 

1 — Fr. (t) 

Using this type of argument for connecting the probability of transitions from states at time 
t to states at time t + At and ignoring transitions from state 2 to state 0 and vice versa 
enables us to write the state probability at time t+ At in terms of the state probability at 
t in vector matrix form: 


PEST, Stt Ait S i= = pAt + o(At). 


Pi\t+At)} =} 2wAt 1—-(A+p)At 2dAAt P,(t) | + o0(At), 


where o(At) denotes a quantity of lower order than At. 
Rearranging, we have 


Po(t Tr At) — Po(t) 2 r 0 Po(t) 
Pi(t+ At) — Pi(t)| = | 2u -(A+p) 2A P,(t) | At + o(At). 
Pr(t Tr At) = P(t) 0 a —2 P(t) 


Dividing both sides by At and using an obvious matrix notation, we obtain 


dP(t) 
—, = AP(t). (9.2-21) 
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The matrix A is called the generator of the Markov chain X. This first-order vector differ- 


ential equation can be solved for an initial probability vector, P(0) 4 Po, using methods of 
linear-system theory [9-3]. The solution is expressed in terms of the matrix exponential 


1 


AtA 1 2 
e- =I1+At+ 5 (At) + 3] 


(Ag)? Bese 
which converges for all finite t. The solution P(t) is then given as 
P(t) = e4'Po, t>0. 


For details on this method as well as how to obtain an explicit solution, see [9-4]. 

For the present we content ourselves with the steady-state solution obtained by setting 
the time derivative in Equation 9.2-21 to zero, thus yielding AP=0. From the first and last 
rows we get 


and 


From this we obtain P; = (2u/X)Py and Py = (u/2A)P, = (u/A)?Po. Then invoking 
Po + P, + Po =1, we obtain Py = \7/(\? + 2ud + p?) and finally 


ik 2 21T 
P = 1)? ua, 27. 
Sars Wa ae 


Thus the steady-state probability of both processors being down is Py = [A/(A+ )]?. 
Incidentally, if we had used only one processor modeled by a two-state Markov chain, we 
would have obtained Po = X/(\ + 1). 


Clearly we can generalize this example to any number of states n with independent expo- 
nential interarrival times between these states. In fact, such a process is called a queueing 
process. Other examples are the number of toll booths busy on a superhighway and conges- 
tion states in a computer or telephone network. For more on queueing systems, see [9-2]. 
An important point to notice in the last example is that the exponential transition times 
were crucial in showing the Markov property. In fact, any other distribution but exponential 
would not be memoryless, and the resulting state-transition process would not be a Markov 
chain. 


Birth-Death Markov Chains 


A Markov chain in which transitions are permissible only between adjacent states is called 
a birth—-death chain. We first deal with the case where the number of states is infinite and 
afterwards treat the finite-state case. 
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Figure 9.2-9 Markov state diagram for the birth-death process showing transition rate parameters. 


1. Infinite-length queues. The state-transition diagram for the infinite-length queue is 
shown in Figure 9.2-9.' In going from state i to state i+1, we say that a birth has occurred. 
Likewise, in going from state i to state i—1 we say a death has occurred. At any time t, P;(t) 
is the probability of being in state 7, that is of having a “population” of size j, in other 
words the excess of the number of births over deaths. In this model, births are generated by 
a Poisson process. The times between births 7g, and the time between deaths Tp, depend 
on the states but obey the exponential distribution with parameters A; and j1,;, respectively. 
The model is used widely in queuing theory where a birth is an arrival to the queue and a 
death is a departure of one from the queue because of the completion of service. An example 
is people waiting in line to purchase a ticket at a single-server ticket booth. If the theater is 
very large and there are no restrictions on the length of the queue (e.g., the queue may block 
the sidewalk and create a hazard), overflow and saturation can be disregarded. Then the 
dynamics of the queue are described by the basic equation W,, = max{0,W,-1+7s — Ti}, 
where W,, is the waiting time in the queue for the nth arrival, 7, is the service time for 
the (n — 1)st arrival, and 7; is the interarrival time between the nth and (n— 1)st arrivals. 
This is an example of unrestricted queue length. On the other hand data packets stored in a 
finite-size buffer memory present a different problem. When the buffer is filled (saturation), 
a new arrival must be turned away (in this case we say the datum packet is “lost”). 
Following the procedure in Example 9.2-4, we can write that 


P(t + At) = BP(t), 


where 
Pike yu, At 0 re 
WAP 1] Oat eAt [iy At xs 
B= 0 At b= (Ap+fs\At ght 0 


Rearranging and dividing by A¢ and letting At — 0, we get 
dP (t)/dt = AP(t), 


tIn keeping with standard practice, we draw the diagram showing only the transition rate parameters 
that is, the j1;’s and A,;’s over the links between states. This type of diagram does not show explicitly, for 
example, that in the Poisson case the short-time probability of staying in state i is 1—(A;+,;)At. While this 
type of diagram is less clear, it is less crowded than, say, the nonstandard short-time transition probability 
diagram in Figure 9.2-8. 
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where P(t) = [Po(t), Pi(t),..., Pj(t),...J7, and A, the generator matrix for the Markov 
chain is given by 


— Xo by 0 
Ao = (Ar + Hy) fy One 
Sl 0 At —(A2+ M2) Ha 0 


In the steady state P’(t) = 0. Thus, we obtain from AP = 0, 
Py = Pi Po, 


P2 = poP1 = pip2Po, 


Pj = pjPj-1 = 03° Papi Po, 


where p; = Aj—1/m;, for j > 1. 

Assuming that the series converges, we require that ear P, = 1. With the notation 
v5 = Pj°** Poh, and ro = 1, this means Py 07-7; = 1 or Po = 1/0729 7i. Hence the 
steady-state probabilities for the birth-death Markov chain are given by 


P,=1/ Dory 920. 
i=0 


Failure of the denominator to converge implies that there is no steady state and therefore 
the steady-state probabilities are zero. This model is often called the M/M/1 queue. 


2. M/M/1 Queue with constant birth and death parameters and finite storage L. 
Here we assume that A; = A and yw; = yp, for all 7, and that the queue length cannot 
exceed L. This stochastic model can apply to the analysis of a finite buffer as shown in 
Figure 9.2-10. The dynamical equations are 


dPo(t)/dt = —APo(t) + wPi(é) 
dP, (t)/dt = +APo(t) — (A+ 1) P(t) + wPo(t) 


dPy(t)/dt = +APy_1(t) — wPr(t). 


Note that the first and last equations contain only two terms, since a death cannot occur in 
an empty queue and a birth cannot occur when the queue has its maximum size L. From 
these equations, we easily obtain that the steady-state solution is P; = p'Po, for 0 <i< L, 


where p = A/. From the condition that the buffer must be in some state, we obtain that 
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Figure 9.2-10 Illustration of packet arriving at buffer of finite size L. 


9 p'Po = 1, or that Py = (1 — p)/(1— p**!). Saturation occurs when the buffer is full. 
The steady-state probability of this event is Pp = p”(1 — p)/(1 — p”*"). Thus for a birth 
rate which is half the death rate, and a buffer of size of 10, the probability of saturation is, 
approximately, 5 x 1074. 


Example 9.2-5 
(average queue size) In computer and communication networks, packet switching refers to 
the transmission of blocks of data called packets from node to node. At each node the packets 
are processed with a view toward determining the next link in the source-to-destination 
route. The arrival time of the packets, the amount of time they have to wait in a buffer, 
and the service time in the CPU (the central processing unit) are random variables. 

Assume a first-come, first served, infinite-capacity buffer, with exponential service time 
with parameter 4, and Poisson-distributed arrivals with a Poisson rate parameter of A 
arrivals per unit time. We know from earlier in this section that the interarrival times of 
the Poisson process are i.i.d. exponential random variables with parameter A. The state 
diagram for this case is identical to that of Figure 9.2-9 except that uw, = Wg =... = and 
Xo = A, =... =X. Then specializing the results of the previous discussion to this example, 
we find that P; = p'Po, for 0 < i, where p = A/p and Po = (1 — p). Thus, in the steady 
state P; = p'(1 — p), and the average number of packets in the queue E[N], is computed 
from 


a ee 
HIN = 2S gq oe 


We leave the details of this elementary calculation as an exercise. 


Example 9.2-6 
(finite capacity buffer) We revisit Example 9.2-5 except that now the arriving data packets 
are stored in a buffer of size L. Consider the following set-up: The data stored in the buffer 
are processed by a CPU on a first-come, first-service basis. 
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Assume that, say at time t, the buffer is filled to capacity, and that there is a packet being 
processed in the CPU and an arriving packet on the way to the buffer. If the interarrival 
time T; between this packet and the previous one is less than 7,, the service time for the 
packet in the CPU, the arriving packet will be lost. The probability of this event is 


P|“packet loss”] = P [“saturation” N{T, > T;}] 
= p*(1— p)/(1—p"*") x Pits — 7: > O, 


since the event’s “saturation” and {T, > 7;} are independent. Since 7, and 7; are inde- 
pendent, the probability P[7, — 7; > 0] can easily be computed by convolution. The result 
is P[T, —T; > 0] = A/(A+ p). The probability of losing the incoming packet is then 


P|*packet loss”] = p”(1— p)/(1— p’**) x p/(1+ p), 


which, for p = 0.5, yields P[“packet loss”] = 1.6 x 10~4 for the buffer of size 10, with arrival 
rate equal to half the service rate. 


Chapman-Kolmogorov Equations 


In the examples of a Markov random sequence in Chapter 8, we specified the transition 
density as a one-step transition, that is, from n— 1 to n. More generally, we can specify 
the transition density from time n to time n+ k, where k > 0, as in the general definition 
of a Markov random sequence. However, in this more general case we must make sure that 
this multistep transition density is consistent, that is, that there exists a one-step density 
that would sequentially yield the same results. This problem is even more important in the 
random process case, where due to continuous time one is always effectively considering 
multistep transition densities; that is, between any two times tg # t,, there is a time in 
between. 

For example, given a continuous-time transition density fx (a2\a1;t2,t1), how do we 
know that an unconditional pdf fx (a; t) can be found to satisfy the equation 


+oo 


fx(zaste) = f iat tpi 


—co 


for all tg > t,, and all x; and 2x2? 

The Chapman—Kolmogorov equations supply both necessary and sufficient conditions 
for these general transition densities. There is also a version of the Chapman-—Kolmogorov 
equations for the discrete-valued case involving PMFs of multistep transitions. 

Consider three times t3 > tg > t; and the Markov process random variables at these 
three times X(t3), X(t2), and X(t1). We wish to compute the conditional density of X (ts) 
given X(t,). First, we write the joint pdf 


+00 
fx (x3, 213 ta, t1) = fx (w3|v2, 01; tg, to, t1) fx (2, £1; te, t1 dare. 


—co 
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If we now divide both sides of this equation by f(x1;t,), we obtain 


+00 
fx(cals) =f fx(aslea,21)fx (naked, 
—co 
where we have suppressed the times ¢; for notational simplicity. Then using the Markov 
property the above becomes 


+oo 


fx(agler) = / fax walara) fx(aler)der, (9.2-29) 


—Co 


which is known as the Chapman—Kolmogorov equation for the transition density fy (x3|21) 
of a Markov process. This equation must hold for all t3 > t2 > t; and for all values of x3 and 
x1. It can be proven that the Chapman—Kolmogorov condition expressed in Equation 9.2-22 
is also sufficient for the existence of the transition density in question [9-5]. 


Random Process Generated from Random Sequences 


We can obtain a Markov random process as the limit of an infinite number of simulations 
of Markov random sequences. For example, consider the random sequence generated by the 
equation 

X([n] = pX[n— 1] + W[n], —co <n < +00, 


as given in Example 8.4-6 of Chapter 8, where |p| < 1.0 to ensure stability. There we found 
that the correlation function of X[n] was 


Rxx|m] = of pl™, 


where o7, is the variance of the independent random sequence W[n]. Replacing X[n] with 
X (nT), and setting X(t) = X[nT] for nT < t < (n+1)T, we get 


Rxx(t+7,t) = oj pl/"| = of, exp(—alr)), 


where a 2 7 in : or alternatively p = exp(—aT). Thus, if we generate a set of simulations 


with Tj, 4 To/k for k = 1,2,3,..., and then for each simulation set p, 4 */exp(—aTo), we 
will get a set of denser and denser approximations to a limiting random process X(t), that 
is WSS with correlation function 


Rxx(t+T,t) = oo exp(—a|r]). 


9.3 CONTINUOUS-TIME LINEAR SYSTEMS WITH RANDOM INPUTS 


In this section we look at transformations of stochastic processes. We concentrate on the 
case of linear transformations with memory, since the memoryless case can be handled by 


Sec. 9.3. CONTINUOUS-TIME LINEAR SYSTEMS WITH RANDOM INPUTS 573 


the transformation of random variables method of Chapter 3. The definition of a linear 
continuous-time system is recalled first. 


Definition 9.3-1 Let x, (t) and x(t) be two deterministic time functions and let a; 
and ay be two scalar constants. Let the linear system be described by the operator equation 
y = L{x}. Then the system is linear if 


L{a,x1(t) + a2X2(t)} = a, L{x,(t)} + agL{x2(t)} (9.3-1) 
for all admissible functions x; and x2 and all scalars a; and a2. 


This amounts to saying that the response to a weighted sum of inputs must be the 
weighted sum of the responses to each one individually. Also, in this definition we note 
that the inputs must be in the allowable input space for the system (operator) LZ. When 
we think of generalizing LZ to allow a random process input, the most natural choice is to 
input the sample functions of X and find the corresponding sample functions of the output, 
which thereby define a new random process Y. Just as the original random process X is 
a mapping from the sample space to a function space, the linear system in turn maps this 
function space to a new function space. The cascade or composition of the two maps thus 
defines an output random process. This is depicted graphically in Figure 9.3-1. Our goal in 
this section will be to find out how the first- and second-order moments, that is, the mean 
and correlation (and covariance), are transformed by a linear system. 


Theorem 9.3-1 Let the random process X(t) be the input to a linear system L with 
output process Y(t). Then the mean function of the output is given as 


EY (t)] = L{ELX()]} 


(9.3-2) 
= [tx (t)}- 
Input sample function 
X(t) =X(f1,t) 
a a inear 
ye 
yz (t)= Y(¢,,t) 


Sample space 


| Output sample function 


Figure 9.3-1 Interpretation of applying a random process to a linear system. 
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Proof (formal). By definition we have for each sample function 
Y(t,6) = L{X(t,0)} 
so 
Ely (t)] = E[L{X()}1- 


If we can interchange the two operators, we get the result that the mean function of the 
output is just the result of Z operating on the mean function of the input. This can be 
heuristically (formally) justified as follows, if we assume the operator L can be represented 
by the superposition integral: 


=L{ux()}. ff 


We present a rigorous proof of this theorem after we study the mean-square stochastic 
integral in Chapter 10. For now, we will assume it is valid, and next look at how the corre- 
lation function is transformed by a linear system. There are now two stochastic processes 
to consider, the input and the output, and the cross-correlation function ELX (t1)Y*(t2)| 
comes into play. We thus define the cross-correlation function 


Rxy (t, t2) 4 E[X (t1)¥*(t2)]. 


From the autocorrelation function of the input Rx x(ti,t2), we first calculate the 
cross-correlation function Rxy(ti,t2) and then the autocorrelation function of the output 
Ryy(t1,t2). If the mean is zero for the input process, then by Theorem 9.3-1 the mean 
of the output process is also zero. Thus the following results can be seen also to hold for 


covariance functions by changing the input to the centered process X,(t) 2x (t) — px (t), 
which produces the centered output Y2(t) 4 Y(t) — py (Et). 


Theorem 9.3-2 Let X(t) and Y(t) be the input and output random processes of the 
linear operator L. Then the following hold: 


Rxy (ti, te) = L5{Rx x(t, te)}, (9.3-3) 
Ryy (ti, t2) = Li{Rxy(t1, ta) }, (9.3-4) 


where L; means the time variable of the operator L is t;. 
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Proof (formal). Write 
X(ti)¥"(ta) = X(t) La {X" (ta) } 
= L3{X(t)X*(t2)}, 


where we have used the adjoint operator L* whose impulse response is h*(t,7), that is, the 
complex conjugate of h(t, 7). Then 


E[X (ti) ¥"(ta)| = EB [Lat X (th) X" (ta) FI 
= L3{E|X(t1)X*(t2)|} by interchanging L5 and EF, 
= L3{Rxx (ti, ta) }, 
which is Equation 9.3-3. Similarly, to prove Equation 9.3-4, we multiply by Y*(t2) and get 
Y(ti)¥" (ta) = Li{ X(t) ¥" (ta) } 
so that 
ELY (t1)¥" (ta)| = B [Lit X(t) ¥* (ta) }] 
= Ii{E|X(t)Y*(t2)|} by interchanging ZL, and E£, 
= 11{Rxy (t1, t2)}, 
which is Equation 9.3-4. If we combine Equation 9.3-3 and Equation 9.3-4, we get 
Ryy (ti, t2) = L5{Rxx(ti,te)}. (9.3-5) 


Example 9.3-1 
(edge or “change” detector) Let X(t) be a real-valued random process, modeling a certain 


sensor signal, and define Y(t) é L{X(t)} 4 X(t) — X(t— 1) so 


ElY (t)] = L{ux(t)} = ex (t) — wx (t— 1). 


Also 
Rxy (ti, te) = Le{Rxx(ti, te)} = Rxx(t, te) — Rxx(ti, te — 1) 


and 
Ryy (t1, t2) = Li{Rxy (ti, ta)} = Rxy (ti, t2) — Rxy(t — 1, te) 
= Rxx(ti,te) — Rx x(ti —1,te) — Rxx(ti,te —1) 
a: Beli 141), 


To be specific, if we take y(t) = 0 and 


A 
Rx x(t, ta) = ox exp(—alty _ tal), 
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Rxx(7) 


Figure 9.3-2 Input correlation function Rxx of Example 9.3-1 versus T = ty — to. 


then 
E|Y (#)] =0 since ix = 0, 


and 
Ryy (ti, ta) = 0% (2 exp( a|t, — t2|) — exp(—a|t, — tg — 1]) — exp(—alt, tz +1))). 


We note that both Rxx and Rxy are functions only of the difference of the two 
observation times t; and tg. The input correlation function Rxx is plotted in Figure 9.3-2, 
for a = 2 and 0% = 2. Note the negative correlation values in output correlation function 
Ryy, shown in Figure 9.3-3, introduced by the difference operation of the edge detector. 
The variance of Y(t) is constant and is given as 


o%(t) =o} = 20%,[1 — exp(—a)]. 
We see that as a tends to zero, the variance of Y goes to zero. This is because as a tends 
to zero, X(t) and X(t—1) become very positively correlated, and hence there is very little 
power in their difference. 
Example 9.3-2 


(derivative process) Let X(t) be a real-valued random process with constant mean function 
[tx (t) = pw and covariance function 


Kxx(t, s) = 0? coswo(t — s). 
We wish to determine the mean and covariance function of the derivative process X’(t). 
Here the linear operator is d(-)/dt. First we determine the mean, 


x(t) = E[X"()] = SEX] = Sux(t) = Sp =0 
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Figure 9.3-3 Output correlation function Ryy of Example 9.3-1 versus 7 = t; — to. 


Now, for this real-valued random process, the covariance function of X’(t) is 


Kx x/(t1, tg) = E[X'(t1)X'(t2)| 


y] 


since pi, (t) = 0. Thus by Equation 9.3-5, with X’(t) = Y(t), 


0 0 0 0 
K xx: (ti, ta) = Ot, (ok xx(t.t) = at (fe COS Wo (t1 = 1) 
= ap wor sin wo(t, — t2)) = (woo)? cos wo(ty — t2). 
1 


We note that the result is just the original covariance function scaled up by the factor w?. 
This similarity in form happened because the given Ky x(t,s) is the covariance function 
of a sine wave with random amplitude and phase (cf. Example 9.1-5). Since the phase is 
random, the sine and its derivative the cosine are indistinguishable by shape. 


White Noise 


Let the random process under consideration be the Wiener process of Section 9.2. Here 
we consider the derivative of this process. For any a@ > 0, the covariance function of 
the Wiener process is Ky x(ti,t2) = amin(t;,t2) and its mean function wy = 0. Let 
W(t) = dX(t)/dt. Then proceeding as in the above example, we can calculate piy(t) = 
E|dX (t) /dt] = dix (t)/dt = 0. For the covariance, 
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; to < ty 
> e>th 


7 ty < te 
, %t, >to 


_ oO <3 ate, te<tl 
~ Oty \Otp | ati, tk >t 


Thus the covariance function of white noise is the impulse function. Since white noise 
always has zero mean, the correlation function too is an impulse. It is common to see 


Rww (ti, t2) = 076(t, — te) = Kww(t, te), (9.3-6) 


with a replaced by o?, but one should note that the power in this process E[|W(t)|?] = 
a75(0) = 00, not o?. In fact, o? is a power density for the white noise process. 

Note that the sample functions are highly discontinuous and the white noise process is 
not separable. 


9.4 SOME USEFUL CLASSIFICATIONS OF RANDOM PROCESSES 


Here we look at several classes of random processes and pairs of processes. These classifica- 
tions also apply to the random sequences studied earlier. 


Definition 9.4-1 Let X and Y be random processes. They are 


(a) Uncorrelated if Rxy (ti, te) = bx (ti) uy (t2), for all ty and to; 

(b) Orthogonal if Rxy(t1,t2) = 0 for all t; and to; 

(c) Independent if for all positive integers n, the nth-order CDF of X and Y factors, 
that is, 


Fy (21, y1, 2, Y2,- : -»2n,Ynj3 ti, - : bsihy) 
= Fy(a,...,0njti,+--5tn) Fy (yt, Unitiy.<.stn), 
for all x;, y; and for all t;,...,t,. 
}The idea of separability (cf. Section 9.1) is to make a countable set of points on the t-axis (e.g., time- 


axis) determine the properties of the process. In effect it says that knowing the pdf over a countable set of 
points implies knowing the pdf everywhere. See [9-6]. 
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Note that two random processes are orthogonal if they are uncorrelated and at least one 
of their mean functions is zero. Actually, the orthogonality concept is useful only when the 
random processes under consideration are zero-mean, in which case it becomes equivalent to 
the uncorrelated condition. The orthogonality concept was introduced for random vectors 
in Chapter 5. This concept will prove useful for estimating random processes and sequences 
in Chapter 11. 

A random process may be uncorrelated, orthogonal, or independent of itself at earlier 
and/or later times. For example, we may have Rx x(ti,t2) = 0 for all t; ¥ te, in which 
case we call X an orthogonal random process. Similarly X(t) may be independent of 
{X(t1),...,X(tn)} for all t € {t1,...,t,} and for all t),...,t,, and for all n > 1. Then we 
say X(t) is an independent random process. Clearly, the sample functions of such processes 
will be quite rough, since arbitrarily small changes in t yield complete independence. 


Stationarity 


We say a random process is stationary when its statistics do not change with the continuous 
parameter, often time. The formal definition is: 


Definition 9.4-2 A random process X(t) is stationary if it has the same nth-order 
CDF as X(t+ 7), that is, the two n-dimensional functions 


Fx (01,...,2njti,-.-,tn) = Px(a1,..-,¢njt +T,...,tn +T) 
are identically equal for all T, for all positive integers n, and for all t),...,tn. 
When the CDF is differentiable, we can equivalently write this in terms of the pdf as 
fx(@1,---;Enjti,.--,tn) = fx(1,-.-,Unjti +T,...,tn +T), 


and this is the form of the stationarity condition that is most often used. This definition 
implies that the mean of a stationary process is a constant. To prove this note that f(a; t) = 
f(a;t+T) for all T implies f(a; t) = f(a;0) by taking T = —t, which in turn implies that 
E|X(t)] = x(t) = “x (0), a constant. 

Since the second-order density is also shift invariant, that is, 

F (x1, 29; ti, ta) = f(v1,%a3t1 + T,t2 + T), 
we have, on choosing T = —tg, that 
f (21, £2; t1, t2) = f(x1, £9; ti — t2,0), 

which implies EX (t,)X*(t2)] = Rxx(ti — t2,0). In the stationary case, therefore, the 
notation for correlation function can be simplified to a function of just the shift 7 = ty — te 


between the two sampling instants or parameters. Thus we can define the one-parameter 
correlation function 


Rxx(r) = Rx x(r,0) 
= E[X(t+17)X*(t)], 


(9.4-1) 
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which is functionally independent of the parameter t. Examples of this sort of correlation 
function were seen in Section 9.3. 

A weaker form of stationarity exists which does not directly constrain the nth-order 
CDFs, but rather just the first- and second-order moments. This property, which is easier 
to check, is called wide-sense stationarity and will be quite useful in what follows. 


Definition 9.4-3 A random process X is wide-sense stationary (WSS) if E[X(t)] = 
fix, a constant, and E[X(t + 7)X*(t)] = Rxx(r) for all —co < T + cw, independent of the 
time parameter t. [i 


Example 9.4-1 


(WSS complex exponential) Let X(t) 4 Aexp(j2aft) with f a known real constant and 
A a real-valued random variable with mean E[A] = 0 and finite average power E[A?]. 
Calculating the mean and correlation of X(t), we obtain 


E[X(t)] = B[Aexp(j27 ft)] = E[A] exp(j27 ft) = 0, 
and 
E[X(t + 7) X*(t)] = E[Aexp(j2af(t + r))Aexp(—j2aft)] = E[A?] exp(j2afr) = Rxx(r). 


Note that E[A] = 0 is a necessary condition for WSS here. Question: Would this work with 
a cosine function in place of the complex exponential? 


The process in Example 9.4-1, while shown to be wide-sense stationary, is clearly not 
stationary. Consider, for example, that X(0) must be pure real while X(1/(4f)) must always 
be pure imaginary. We thus conclude that the WSS property is considerably weaker than 
stationarity. 

We can generalize this example to have M complex sinusoids and obtain a rudimentary 
frequency domain representation for zero-mean WSS random processes. Consider 


M 
X(t) = D> Agexp(j2m fit), 


k=1 


where the generally complex random variables A; are uncorrelated with mean zero and 
variances 0%. Then the resulting random process is WSS with mean zero and autocorrelation 
(or autocovariance) equal to 


M 
Rxx(T) = So oR exp(j27f,T). (9.4-2) 
k=1 


For such random processes X(t), the set of random coefficients {A;} constitutes a frequency 
domain representation. From our experience with Fourier analysis of deterministic functions, 
we can expect that as MW became large and as the f; became dense, that is, the spacing 
between the f; became small and they cover the frequency range of interest, most random 
processes would have such an approximate representation. Such is the case (cf. Section 10.6). 
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9.5 WIDE-SENSE STATIONARY PROCESSES AND LSI SYSTEMS 


In this section we treat random processes that are jointly stationary and of second order, 


that is, 


E||X(t)|?] < co. 


Some important properties of the auto- and cross-correlation functions of stationary second- 
order processes are summarized as follows. They, of course, also hold for the respective 
covariance functions. 


3a. 


|IRxx(r)| <  Rxx(0), which, for the real case, directly follows from 
El|X(t+7) — X(#)/?] > 0. 

|Rxy(T)| < /Rxx(0)Ryy (0), which is derived using the Schwarz inequality. (cf. 
Section 4.3. Also called diagonal dominance.) It also proves the complex case of 1. 
Rxx(t) = Rx(-7), since E[X(t + T)X*(t)] = ELX(t)X*(t — 7)] = 
E*|X(t—r)X*(t)] for WSS random processes, which is called the conjugate symmetry 
property. In the special case of a real-valued process, this property becomes that of 
even symmetry, that is, 

Rxx(T) = Rxx(-7). 

Another important property of the autocorrelation function of a complex-valued, 
stationary random process is that it must be positive semidefinite, that is, 

for all N > 0, all ty < ta <...< ty and all complex a1, a2,...,ay, 


N 
So anaz Rx x ( ty, — ty) > 0. 
1 = 


i |= 


This was shown in Section 9.1 to be a necessary condition for a given function 
g(t, s) = g(t — s) to be an autocorrelation function. We will show that this prop- 
erty is also a sufficient condition, so that positive semidefiniteness actually charac- 
terizes autocorrelation functions. In general, however, it is very difficult to check 
property (4) directly. 


To start off, we can specialize the results of Theorems 9.3-1 and 9.3-2, which were derived 
for the general case, to LSI systems. Rewriting Equation 9.3-2 we have 


BLY (t)] = L{ux(t)} 


2 i lig (PAE = 2) ae 


=00 


= x(t) * h(t). 


Using Theorem 9.3-2 and Equations 9.3-3 and 9.3-4, we get also 


+00 
Rxy (ti, te) = / h* (72) Rx x (ti, t2 — T2)dr2, 


—CoO 
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and 
+o0o 


Ryy (ti, t2) =| h(t) Rxy (ti — 71, t2)dr1, 


—Co 


which can be written in convolution operator notation as 
Rxy(t1,t2) = h*(t2) * Rxx(t, ta), 
where the convolution is along the tg-axis, and 
Ryy (ti, te) = h(t) * Rxy (th, ta), 
where the convolution is along the t,-axis. Combining these two equations, we get 
Ryy (ti, ta) = h(ti) * Rx x(t1, te) * h*(t2). 
Wide-Sense Stationary Case 
If we input the stationary random process X(t) to an LSI system with impulse response 


h(t), then the output random process can be expressed as the convolution integral, 


Y(t) = i h(r) X(t — 7dr, (9.5-1) 


—co 


when this integral exists. Computing the mean of the output process Y(t), we get 


EIY (t)] = / ee AEX =slar ‘by ‘Theosem 9134, 


+oo +00 te, 
=f Weryuxdr = px [rode es 
= px H(0), 


where H(w) is the system’s frequency response. 

We thus see that the mean of the output is constant and equals the mean of the input 
times the system function evaluated at w = 0, the so-called “dc gain” of the system. If we 
compute the cross-correlation function between the input process and the output process, 
we find that 


Ryx(7) = E[Y (t+ 7)X*(t)] 


E|Y(t)X*(t—7)] by substituting ¢ — 7 for ft, 


+00 
q | h(a) E[X(t — a)X*(t — r)]da, 


lo.<) 


and bringing the operator F inside the integral by Theorem 9.3-2, 


+0o 
- / h(a) Rxx(7T — a)da, 
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which can be rewritten as 
Ryx(r) = A(t) * Rxx(r). (9.5-3) 
Thus, the cross-correlation Ry x equals h convolved with the autocorrelation Rx x. This 
fact can be used to identify unknown systems (see Problem 9.28). 
The output autocorrelation function Ryy(r) can now be obtained from Ryx(T) as 
follows: 


Ryy(t) = ElY(t+7)Y*(0)] 


Combining both equations, we get 
Ryy(r) = h(t) «h*(—T) * Rxx(7). (9.5-4) 


We observe that when Ryx(T) = d(T), then the output correlation function is 
Ryy(r) = h(r) * h*(—r), which is sometimes called the autocorrelation impulse response 


(AIR) denoted as g(r) = h(r) * h*(—r). Note that g(r) must be positive semidefinite, and 
indeed FT {g(r)} = |H(w)|? > 0. 
Similarly, we also find (proof left as an exercise for the reader) 


+oo 
Rxy(r) = i =o) Rar — ade (9.5-5a) 
= h*(—7T) * Rxx(rt), 


and 
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This elegant and concise notation is shorthand for 
Ryy(r) = J 9(7')Rxx(t —7')dr’ (a convolution) (9.5-5b) 


g(r") = [2 h*(a)h(a+7')da. (a correlation product) (9.5-5c) 


Example 9.5-1 
(derivative of WSS process) Let the second-order random process X(t) be stationary with 
one-parameter correlation function Rx (rT) and constant mean function pix (t) = x. Consider 
the system consisting of a derivative operator, that is, 
dX (t) 

Y(t) = ae 

Using the above equations, we find py(t) = dux(t)/dt = 0 and cross-correlation func- 
tion 


Rxy(tT) = uj (—T) * Rxx(7) 
=e dRxx (7) 
7 dr” 
since the impulse response of the derivative operator is h(t) = dé(t)/dt = ui (t), the (formal) 
derivative of the impulse 6(t), sometimes called the unit doublet.t 
Ryy (rT) = u(r) * Rxy(r) 
= dRxy(rT) 
7 dt 
a a? Rxx (7) 
7 dr2 


Notice the AIR function here is g(t) = —ue(rT), minus the second (formal) derivative of 


(rT). 


Power Spectral Density 


For WSS, and hence for stationary processes, we can define a useful density for average 
power versus frequency, called the power spectral density (psd). 


Definition 9.5-1 Let Rxx(r) be an autocorrelation function. Then we define the 
power spectral density Sx x(w) to be its Fourier transform (if it exists), that is, 
+00 
Sxx(w) = Rxx(r)e 72" dr. (9.5-6) 
=O 
Under quite general conditions one can define the inverse Fourier transform, which 
equals Rx x(r) at all points of continuity, 


: oe 
Rxx(T) = =| Sxx(w)etI*7 dw. (9.5-7) 


TIn this u-function notation, u_;(t) = u(t) the unit step function, and uo(t) = 6(t) the unit impulse 
[9-9]. 


Sec. 9.5. WIDE-SENSE STATIONARY PROCESSES AND LSI SYSTEMS 585 


Table 9.5-1 Correlation Function Properties of Corresponding Power Spectral Densities 


Random Process Correlation Function Power Spectral Density 
X (t) Rxx(rT) Sxx(w) 
aX (t) |a|?Rxx(r) |a|?Sxx(w) 
X1(t) + Xe(t) with 
X, and X2 orthogonal Rx,x,(T)+ Rxoxo(T) Sx, x, (w) + Sxox5(w) 
X'(t) —d?Rxx(r)/dr? w Sx x (w) 
XM) (—1)"?"Rxx(r)/dr?"— w?" Sxx(w) 
X (t) exp(jwot) exp(jwoT)Rxx(T) Sxx(w — wo) 
X(t) cos(wot + O) 
with independent © 3Rxx(rT) cos(woT) ¢[Sxx(w + wo) + Sxx(w — wo)] 
uniform on [—7, +7] 
X(t)+6 (E[X(t)] =0)  Rxx(r) +0)? Sxx(w) + 2n|b|?5(w) 


In operator notation we have, 
Sxx = FT{Rxx} 


and 
Rxx =IFT{Sxx}, 


where FT and IFT stand for the respective Fourier operators. 

The name power spectral density (psd) will be justified later. All that we have done 
thus far is define it as the Fourier transform of Rx x(r). We can also define the Fourier 
transform of the cross-correlation function Rxy(r) to obtain a frequency function called 
the cross-power spectral density, 

+00 
Sxy(w) £ Rxy(r)e7##7 dr. (9.5-8) 
—co 
We will see later that the psd S'xx(w), is real and everywhere nonnegative and in fact, 
as the name implies, has the interpretation of a density function for average power versus 
frequency. By contrast, the cross-power spectral density has no such interpretation and is 
generally complex valued. 


We next list some properties of the psd Sx x(w): 


1. Sxx(w) is real valued since Rx x (rT) is conjugate symmetric. 

2. If X(t) isa real-valued WSS process, then Sx x(w) is an even function since Rx x(T) 
is real and even. Otherwise S'x x(w) may not be an even function of w. 

3. Sxx(w) > 0 (to be shown in Theorem 9.5-1). 


Additional properties of the psd are shown in Table 9.5-1. One could go on to expand 
this table, but it will suit our purposes to stop at this point. One comment is in order: We 
note the simplicity of these operations in the frequency domain. This suggests that for LSI 
systems and stationary or WSS random processes, we should solve for output correlation 
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functions by first transforming the input correlation function into the frequency domain, 
carry out the indicated operations, and then transform back to the correlation domain. 
This is completely analogous to the situation in deterministic linear system theory for 
shift-invariant systems. 

Another comment would be that if the interpretation of Sx x(w) as a density of average 
power is correct, then the constant or mean component has all its average power concen- 
trated at w = 0 by the last entry in the table. Also by the next-to-last two entries in 
the table, modulation by the frequency wo shifts the distribution of average power up in 
frequency by wo. Both of these results should be quite intuitive. 


Example 9.5-2 
(power spectral density of white noise) The correlation function of a white noise process 
W(t) with parameter o? is given by Rww(r) = 076(r). Hence the power spectral density 
(psd), its Fourier transform, is just 


Sww(w) =07, —-co<w<-+too. 


The psd is thus flat, and hence the name, white noise, by analogy to white light, which 
contains equal power at every wavelength. Just like white light, white noise is an idealiza- 
tion that cannot physically occur, since as we have seen earlier Rww (0) = ov, necessitating 
infinite power. Again, we note that the parameter o? must be interpreted as a power density 
in the case of white noise. 


An Interpretation of the psd 


Given a WSS process X(t), consider the finite support segment, 


I> 


Xr(t) =X(t)_rar(d), 


where I/_p,7) is an indicator function equal to 1 if -T’< t < +T and equal to 0 otherwise, 
and T > 0. We can compute the Fourier transform of X7 by the integral 
+T 
FT{X,7(t)} = X(t)e 1" dt. 
7 


The magnitude squared of this random variable is 
+T p+T 
IFT{X7(t)}[2 = / X (ti) X* (toe I") ty dt. 
—f JaT 
Dividing by 2T' and taking the expectation, we get 


1 | eae ad 
ape [|ET{Xr(t)}7] = cae = Rxx(ti = tg)e Ju(hi—ta) dt, dta, (9.5-9a) 


To evaluate the double integral on the right, introduce the new coordinate system s = 
t, + te,7 = t, — tg. The relationship between the (s,7) and (t;,t2) coordinate systems 
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Figure 9.5-1 (a) Square region in (ti, t2) plane; (b) integration in diamond-shaped region created by 
the transformation s= ti + t2,7 = th — tr. 


is shown in Figure 9.5-la. The Jacobian (scale-change) of this transformation is 1/2 and 
the region of integration is the diamond-shaped surface g shown in Figure 9.5-1b, which is 
Figure 9.5-1a rotated counterclockwise 45° and whose sides have length T\/2. The double 
integral in Equation 9.5-9a then becomes 


i 

pe —jwrt 

Gi [[ Rxxtne dt ds 
9 


1 ) 2T +7 
~ ae ve oo aa Le, a | 
1 2T ae 2T-T +2T \7| aus 
° aT 1 cian lo. i | 7 - | 7 a] al 


In the limit as T — +00, this integral tends to Equation 9.5-6 for an integrable Rx x; 
thus 1 
Sxx(w) = Jim SE [FT (Xr()}"] (9.5-9b) 


so that Sx x(w) is real and nonnegative and is related to average power at frequency w. 
We next look at two examples of the computation of psd’s corresponding to correlation 
functions we have seen earlier. 


Example 9.5-3 
Find the power spectral density for the following exponential autocorrelation function with 
parameter a > 0: 

Rxx(r) =exp(—a|r|), —oo < T < +00. 


This is the autocorrelation function of the random telegraph signal (RTS) discussed in 
Section 9.2. Its psd is computed as 
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+00 ; +00 ; 
Sxx(w) = Rxx(t)e 7" dr = / e Alte IUT dr 


—co —co 


0 oo 
= / EO Ie)T dr 4 | e (OFIe)™ dr 
=o 0 


= 2a/[a? +w?], —co <w <-+oo. 


This function is plotted in Figure 9.5-2 for a = 3. We see that the peak value is at the origin 
and equal to 2/a. The “bandwidth” of the process is seen to be a on a 3 dB basis (if Sx x 
is indeed a power density, to be shown). We note that while there is a cusp at the origin of 
the correlation function Rx x, there is no cusp in its spectral density Sx x. In fact Sx x is 
continuous and differentiable everywhere. (It is true that Sx x will always be continuous if 
Rxx is absolutely integrable.) 

Figure 9.5-2 was created using MATLAB with the short m-file: 


clear alpha=3; 

[1.0 0.0 alpha*2]; 
linspace(-10,+10); 
den = polyval(b,w); 
num = 2*alpha; 

S = num./den; 

plot (w,S) 


qo 
"oot 


We note that the psd decays rather slowly, and thus the RTS process requires a signif- 
icant amount of bandwidth. The reason the tails of the psd are so long is due to the jumps 
in the RTS sample functions. 
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Figure 9.5-2 Plot of psd for exponential autocorrelation function. 
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Example 9.5-4 
(psd of triangular autocorrelation) Consider an autocorrelation function that is triangular 
in shape such that the correlation goes to zero at shift T’ > 0, 


Rxx(T) = max E — Fo : 
£ 
One way this could arise is the asynchronous binary signaling (ABS) process introduced in 
Section 9.2. This function is plotted as Figure 9.5-3. If we realize that this triangle can be 
written as the convolution of two rectangular pulses, each of width T and height 1//T, 
then we can use the convolution theorem of the Fourier transform [9-3,-4] to see that the 
psd of the triangular correlation function is just the square of the Fourier transform of the 
rectangular pulse, that is, the sinc function. The transform of the rectangular pulse is 


sin(wT/2) 
a (wT/2) ° 


and the power spectral density Sx x of the triangular correlation function is thus 


Sy (9.5-10) 


Sxx(w) =T ( wT/2 


As a check we note that Sx x(0) is just the area under the correlation function, that in the 
triangular case is easily seen to be T’. Thus checking, 


+00 1 
Sxx(0) = Rxx(T)dr = 2-5 -1-T. 


—co 


Another way the triangular correlation function can arise is the running integral average 
operating on white noise. Consider 


Figure 9.5-3 A triangular autocorrelation function. 
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Figure 9.5-4 Plot of equation versus 5; for to > T. 


with W(t) a white noise with zero mean and correlation function Ryww/(rT) = 6(7). Then 
fix (t) =0 and E[X(t,)X(t2)| can be computed as 


Rxx(t, te) ail t. Rww/(s1 — $2)ds1 dso 
ti- tg— 
2 


~ 7 4=T ae ae 7 ” is oe 


Now defining the inner integral as 


te 
A _ fl,tea-T<s1 < te, 
Gt (si) = [ aoe $1)ds => C else, 


which as a function of s; looks as shown in Figure 9.5-4, so 


1 
Rxx(ti,te) = af Gt, ($1)ds1 
t1-T 


ty _ ta| 
— 1 oa 0 . 
max | ; 


More on White Noise 
The correlation function of white noise is an impulse (Equation 9.3-6), so its psd is a constant 
Sww(w) =o, —oo < w < +00. 


The name white noise thus arises out of the fact that the power spectral density is constant 
at all frequencies just as in white light, which contains all wavelengths in equal amounts.! 
Here we look at the white noise process as a limit approached by a sequence of second-order 


+A mathematical idealization! Physics tells us that, for realistic models, the power density must. tend 
toward zero as w — oo. 
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processes. To this end consider an independent increment process (cf. Definition 9.2-1) with 
zero mean such as the Wiener process (Rx x(ti,t2) = 0? min(ty, t2)) or a centered Poisson 
process, that is, N.(t) = N(t) — At, with correlation Ry, wn, (ti, t2) = Amin(t1, te). Actually 
we need only uncorrelated increments here; thus we require X(t) only to have uncorrelated 
increments. For such processes we have by Equation 9.2-17, 


E (xe +A)— X(t))?| =aA, 


where a is the variance parameter. 
Thus upon letting X(t) denote the first-order difference divided by A, 


Xa(t) 2 [X(€+ A) -— XHM/A, 
we have 
E[X2(t)] =a/A 
and 
E[|Xa(ti)Xa(t2)] = 0 for \to — t| Ae 


If we consider |tz — t;| < A, we can do the following calculation, which shows that the 
resulting correlation function is triangular, just as in Example 9.5-4. Since X (ti +A) — X(t1) 
is distributed as N(0, A), taking t; < tg and shifting ¢; to 0, and tz to tg—ty, the expectation 
becomes 


aE [X (A) (X (tg — ti + A) — X(te — t1))] 
= aE [X(A) (X(A) — X(to—11))] since (A, t2 — t1 + A]N (0, A] = 4, 
1 a 
= AzleA _ a(te _ t1)| = me _ (to _ ty) /A). 


Thus, the process generated by the first-order difference is WSS (the mean is zero) and has 
correlation function Raa(T) given as 


Raa(t) = max] - ol, 


We note from Figure 9.5-5 that as A goes to zero this correlation function tends to a delta 
function. 

Since we just computed the Fourier transform of a triangular function in Example 9.5-4, 
we can write the psd by inspection as 


Saa(w) =a (eeee))” 


This psd is approximately flat out to |w| = 7/(3A). As A = 0, Saa(w) approaches the 
constant a everywhere. Thus as A — 0, Xa(t) “converges” to white noise, the formal 
derivative of an uncorrelated increments process, 
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Figure 9.5-5 Correlation function of Xa(t). 


2 


dt, Oly [o? min(ty, to)| 


Ry y(t, ta) = 


O95 
- ae, (7 u(ti — ta)] 


= o76(ty = ta). 


If one has a system that is continuous in its response to stimuli, then we say that the system 
is continuous; that is, the system operator is a continuous operator. This would mean, for 
example, that the output would change only slightly if the input changed slightly. A stable 
differential or difference equation is an example of such a continuous operator. We will see 
that for linear shift-invariant systems that are described by system functions, the response 
to the random process X(t) will change only slightly when A changes, if A is small and if 
the systems are lowpass in the sense that the system function tends to zero as |w| — +00. 
Thus the white noise can be seen as a convenient artifice for more easily constructing this 
limiting output. (See Problem 9.36.) 

If we take Fourier transforms of both sides of Equation 9.5-3 we obtain the cross-power 
spectral density, 


Bye) = Bioisex(al: (9.5-11) 


Since Sy x is a frequency-domain representation of the cross-correlation function Ry x, 
Equation 9.5-11 tells us that Y(t) and X(t) will have high cross correlation at those frequen- 
cies w where the product of H(w) and Sx x(w) is large. Similarly, from Equation 9.5-5, we 
can obtain 

Sxy(w) = A*(w)Sxx(w). (9.5-12) 


From the fundamental Equation 9.5-4, repeated here for convenience, 


Ryy(r) = h(t) * Rxx(7T) * h*(-7), (9.5-13) 
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we get, upon Fourier transformation, in the spectral density domain, 
Syy (w) = |H(w)|?Sxx(w) = G(w)Sx x(w). (9.5-14) 


These two equations are among the most important in the theory of stationary random 
processes. In particular, Equation 9.5-14 shows how the average power in the output process 
is composed solely as the average input power at that frequency multiplied by |H(w)|?, the 
power gain of the LSI system. We can call G(w) = |H(w)|? the psd transfer function. 


Example 9.5-5 
(average power) The transfer function of an LSI system is given by 


H(w) = senw) (2) ew [-y (w+ 2)] ww, 


where sgn(-) is the algebraic sign function, and where the frequency window function 


A fl, for |w| < 407 
ee e else. 


Let the WSS input random process have autocorrelation function, 
5 
Rxx(r) = 3 (7) +2. 


Compute the average measurable power in the band 0.0 to 1.0 Hertz (single-sided). In 

radians, this is the double-sided range —27 to 2a. First we Fourier transform Rx x(rT) to 
5 

obtain Sx x(w) = 5 + 47d(w). Next we compute the psd transfer function G(w) = |H(w)|? = 

w 


4 
(=) W(w). The output psd then is 
7 


Syv(w) =3 (2) WW), 


and the total average output power would be calculated as 


ee Oem fo > (2)" du, 


OP I atte ee WOR 


while the power in the band [—27, +2z] is 


1 +27 4 
P= — ? (=) dw 
27 J_o, 2 \27 
= 1 watt. 


The following comment on Equations 9.5-3 through 9.5-14 may help you keep track 
of the conjugates and minus signs. Notice that the conjugate and negative argument on 
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the impulse response, which becomes simply a conjugate in the frequency domain, arises 
in connection with the second factor in the correlation. The h(7) without the conjugate or 
negative time argument comes from the linear operation implied by the first subscript, that 
is, the first factor in the correlation. 

With reference to Equation 9.5-11 we see that the cross-spectral density function can 
be complex and hence has no positivity or conjugate symmetry properties, since those 
that Sxx has will be lost upon multiplication with an arbitrary, generally complex H. 
On the other hand, as shown in Equation 9.5-14, the psd of the output will share the 
real and nonnegative aspects of the psd of the input, since multiplication with |H|? will 
not change these properties. Table 9.5-2 sets forth all the above relations for easy 
reference. 

We are now in a position to show that the psd S(w) has a precise interpretation as a 
density for average power versus frequency. We will show directly that S(w) > 0 for all w 
and that the average power in the frequency band (w,w2) is given by the integral of S(w) 
over that frequency band. 


Theorem 9.5-1 Let X(t) be a stationary, second-order random process with correla- 
tion function Rxyx(rT) and power spectral density Syx(w). Then Syx(w) > 0 and 


for all wo > w4, 
1 ve 
= | Sxx(w)dw 


2m Jay 


is the average power in the frequency band (w1,w2). 
Proof Let w2 > w , both be real numbers. Define a filter transfer function as follows: 


A fl, w€ (wi,we) 
ea 6 else, 


Table 9.5-2 Input/Output Relations for Linear Systems with WSS Inputs 


WSS Random Process: Output Mean: 
Y(t) = h(t) « X(t) py = H(0)11 
Crosscorrelations: Cross—Power Spectral Densities: 
Rxy(t) = Rxx(r) * h*(—-7) Sxy(w) = Sxx(w)H*(w) 
Ryx(t) = h(r) * Rxx(T) Sy x(w) = H(w)Sxx(w) 
Ryy (rt) = Ry x(t) ok h*(—T) Syy(w) = Sy x (w) H* (w) 
Autocorrelation: Power Spectral Density: 
Ryy(r) = h(r) * Rxx(r) * h*(—7) Syy (w) = |H(w)|?Sxx (w) 
= g(r) * Rxx(r) = G(w)Sxx(w) 


Output Power and Variance: 


E{\Y(®))?} = Ryy (0) = & JTS |H(e)P?Sxx (w)dw 


co 


oy = Ryy(0) — |pyl? 
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and note that it passes signals only in the band (w,,w2). If X(t) is input to this filter, the 
psd of the output Y(t) is (by Equation 9.5-14) 


Sxx(w), We (w 1, W2) 
0, else. 


Syy(w) = { 


Now the output power in Y(t) has average value E[|Y(t)|?] = Ryy(0), 


I 


1 +00 1 we 
Ryy (0) = al Syy(w)dw ae Sxx(w)dw = 0, 


—co 
and this holds for all w2 > w,. So by choosing w2 ~ w; we can conclude that Sx x(w) > 0 
for all w and that the function Sy x thus has the interpretation of a power density in the 
sense that if we integrate this function across a frequency band, we get the average power 
in that band. 


We saw earlier that the conditions that a function must meet to be a valid correlation 
or covariance function are rather strong. In fact, we have seen that the function must be 
positive semidefinite, although we have not in fact shown that this condition is sufficient. 
It turns out that one more advantage of working in the frequency domain is the ease with 
which we can specify when a given frequency function qualifies as a power spectral density. 
The function simply must be real and nonnegative, that is, S(w) > 0. We can see this for 
a given function F(w) > 0 by taking a filter with transfer function H(w) = \/F(w) and 
letting the input be white noise with Sww = 1. Then by Equation 9.5-14 the output psd is 
Sxx(w) = F(w), thus showing that F is a valid psd. If the random process is real valued, 
as it most often is, then we also need F(w) to be an even function to satisfy psd property 
(2) listed just after Definition 9.5-1. All this can be formalized as follows. 


Theorem 9.5-2 Let F(w) be an integrable function that is real and nonnegative; 
that is, F(w) > 0 for all w. Then there exists a stationary random process with power 
spectral density S(w) = F(w). If the random process is to be real valued, then F'\(w) must 
be an even function of w. 


We now see that the test for a valid spectral density function is much easier than the 
condition of positive semidefiniteness for the correlation function. In fact, it is relatively 
easy to show that the positive semidefinite condition on a function is equivalent to the 
nonnegativity of its Fourier transform, and hence that positive semidefiniteness is the suffi- 
cient condition for a function to be a valid correlation or covariance function. First, by 
Theorem 9.5-2 we know that the positive semidefinite condition is implied by the nonneg- 
ativity of S(w). To show equivalence, it remains to show that the positive semidefinite 
condition on a function f(7) implies that its Fourier transform F'(w) is nonnegative. We 
proceed as follows: Since f(r) is positive semidefinite we have, 


N WN 
x > Ana, f(T — Tm) = 0. 


n=1m=1 
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Also since 


we have 


which can be rewritten as 
N 


xf 0) [DD ax peal dw = — af Fw )) 2 an etIeTn 


where we recognize the term inside the magnitude square sign as a so-called transversal 
or tapped delay-line filter. Thus by choosing N large enough, with the 7, equally spaced, 
we can select the a,,’s to arbitrarily approximate any ideal filter transfer function H(w). 
Then by choosing H to be very narrow bandpass filters centered at each value of w, we can 
eventually conclude that F(w) > 0 for all w, —co < w < +00. We have thereby established 
the following theorem. 


2 
dw > 0, 


Theorem 9.5-3 A necessary and sufficient condition for f(r) to be a correlation 
function is that it be positive semidefinite. [J 


Incidentally, there is an analogy here for probability density functions, which can be 
regarded as the Fourier transforms of their CFs. As we know, nonnegativity is the sufficient 
condition for a function to be a valid pdf (assuming that it is normalized to integrate to 
one); thus the probability density is analogous to the power spectral density; and in fact one 
can define a spectral distribution function [9-7] analogous to the cumulative distribution 
function. Thus the CF and the correlation function are also analogous and so both must be 
positive semidefinite to be valid for their respective roles. Also for the CF the normalization 
of the probability density to integrate to one imposes the condition ®(0) = 1, which is easily 
met by scaling an arbitrary positive semidefinite function that is not identically zero. 


Stationary Processes and Differential Equations 


We shall now examine stochastic differential equations with a stationary or at least WSS 
input, and also with the linear constant-coefficient differential equation (LCCDE) valid for 
all time. We assume that the equation is stable in the bounded-input, bounded-output 
(BIBO) sense, so that the resulting output process is also stationary (or WSS if that is the 
condition on the input process). 

Thus consider the following general LCCDE: 


anY)(t) Ae anaY¥ YQ) +...+a0Y(t) 


= by X™) (t) + byg_1 XM (A) +... +o X(t), co < F< +00. 
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This represents the relationship between output Y(t) and input X(t) in a linear system 
with frequency response 


H(w) = B(w)/A(w), with ap £0, 


where 


a M oo 
B(w) © S bm(jw) 


m=0 


and 
i N 
Aw) = $0 an(jw)”, 
n=0 


which is a rational function with numerator polynomial B(w) and denominator polynomial 
A(w). Because the system is stable, we can apply the results of the previous section to 
obtain 


by = x H(0) 
Sy x(w) = H(w) Sxx(w), 


and 
Syy(w) = |H(w)?Sxx(v), 
where 
H(0) =bo/ap and |A(w)|? =|B@)|?/|A(w)|. 
So 


Hy = (bo/a0) Hx and Syy(w) = (|BW)|?/|AW)|’) Sxx(w). 


This frequency-domain analysis method is generally preferable to the time-domain 
approach but is restricted to the case where both the input and output processes are at least 
WSS. After we obtain the various spectral densities, then we can use the JF'T' to obtain the 
correlation and covariance functions if they are desired. The calculation of the required [F'T's 
is often easier if viewed as an inverse two-sided Laplace transform. The Laplace transform 
of Equation 9.5-3 is 


Sy x(s) = H(s)Sxx(s) (9.5-15) 
while the Laplace transform of Equation 9.5-13 is written 
Syy(s) = H(s)H(—s)Sxx(s) (9.5-16) 


in light of h*(—r) <— H(-—s). Recalling the definition of the two-sided Laplace trans- 
form [9-3], for any f(T) 


o] 


F(a) 2 fo slrjer*rar, 


—Co 
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we note that such a function of the complex variable s may be obtained from the Fourier 
transform F'(w), a function of the real variable w, by a two-step procedure. First set 


A 
F(s)s=jw = F(w) 
and then replace jw by s. An analogous extension method was used earlier for the discrete- 


time case in Chapter 8 where the Fourier transform was extended to the entire complex 
plane by the Z-transform. 


Example 9.5-6 
(output correlation—first-order system) Consider the first-order differential equation 


Y'(t) + aY(t) = X(t), a>0, 
with stationary input X(t) with mean zy = 0 and impulse covariance function Ky x(T) = 
d(T). The system function is easily seen to be 


1 


He) = 


and the psd of the input process is 
Sxx (w) => 1, 
so we have the following cross- and output-power spectral densities: 
1 
at jw’ 
1 1 


~ la + jw|? ~ Q2 +2" 


Syx(w) = H(w)Sxx(w) = 


Syy(w) = |H(w)/?Sxx(w) 


We now convert to Laplace transforms, with s = jw, 


1 
Syy (jw) = z—— 
(a? - (jw)”) 
1 
(a+ ju)(a = ju) 


so that 


1 
1) Br alasray 


Using the residue method (cf. Appendix A) or partial fraction expansion, one can then 
directly obtain the following output correlation function by inverse Laplace transform: 


1 
Ryy(7T) = om exp(—a|r|), -0 <T<+0, 
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which is also the output covariance function since wy = 0. By the above equation for 
Sy x(w) we also obtain the cross-correlation function Ry x (rT) = exp(—ar)u(r). 


In Example 9.5-6 it is interesting that Ryx(7) is 0 for 7 < 0. This means that the 
output Y is orthogonal to all future values of the input X, a white noise in this case. This 
occurs because of two reasons: The system is causal and the input is a white noise process. 
The system causality requires that the output not depend directly on (i.e., not be a function 
of) future inputs but only depend directly on present and past inputs. The whiteness of the 
input X guarantees that the past and present inputs will be uncorrelated with future inputs. 
Combining both conditions we see that there will be no cross-correlation between the present 
output and the future inputs. If we assume additionally that the input is Gaussian, then the 
input process is an independent process and the output becomes independent of all future 
inputs. Then we can say that the causality of the system prevents the direct dependence 
of the present output on future inputs, and the independent process input prevents any 
indirect dependence. These concepts are important to the theory of Markov processes as 
used in estimation theory (cf. Chapter 11). 


Example 9.5-7 
(output correlation function—second-order system) Consider the following second-order 
LCCDE: 


d’y(t)  , dY(t) 
dt? 7? dt 


again with white noise input as in the previous example. Here the system function is 


42Y(t) =5X(0), 


5 5 
(jw)2+3jw+2 (2—w2)+78w" 


H(w) = 


Thus analogously to Example 9.5-6 the output psd becomes 


Sadie 25 ee: 
ae (2—w?)2 + (3w)? w+ 5w2+4° 


Applying the residue method to evaluate the JF'T, we define the function of a complex 
variable Syy(s)|s=ju 2 Syy(w) and rewrite the right-hand side in terms of the complex 
variable jw to obtain 


eC eee | 
(jur)* — 5(jw)? +4 
Substituting s = jw, we get ; 
5 
a alr es 
which factors as 
5 5 


(s+2)(s+1) (—s+2)(-s+]l) H(s)H(—s), 
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where H(s) is the Laplace transform system function. Then the inverse Laplace transform 
yields the output correlation function 


1 
exp(—2|r|)| , —0o <T< +o. 


1 
Ryy(r) = 25 | = exp(—|r]) 5 


6 


We leave the details of the calculation to the interested reader. 


9.6 PERIODIC AND CYCLOSTATIONARY PROCESSES 


Besides stationarity and its wide-sense version, two other classes of random processes are 
often encountered. They are periodic and cyclostationary processes and are here defined. 


Definition 9.6-1 A random process X(t) is wide-sense periodic if there is a T > 0 
such that 
[ix (t) = py (t+ T) for all t 
and 
Kx x(t, te) = Kxx(ti +7, ta) = Kxx(ti, te + T) for all t1, to. 
The smallest such T is called the period. Note that Kx x(t1, t2) is then periodic with period 
T along both axes. 


An example of a wide-sense periodic random process is the random complex exponential 
of Example 9.4-1. In fact, the random Fourier series representation of the process 


X(t) = 3 A, exp (=) (9.6-1) 
k=1 


with random variable coefficients A; would also be wide-sense periodic. A wide-sense peri- 
odic process can also be WSS, in which case we call it wide-sense periodic stationary. We will 
consider these processes further in Chapter 10, where we also refer to them as mean-square 
periodic. The covariance function of a wide-sense periodic process is generically sketched in 
Figure 9.6-1. We see that Ky x(t1,t2) is doubly periodic with a two-dimensional period of 
(T, 7). In Chapter 10 we will see that the sample functions of a wide-sense periodic random 
process are periodic with probability 1, that is, 


X(t) = X(t+T) for allt, 


except for a set of outcomes, i.e. an event, of probability zero. 

Another important classification is cyclostationarity. It is only partially related to peri- 
odicity and is often confused with it. The reader should carefully note the difference in the 
following definition. Roughly speaking, cyclostationary processes have statistics that are 
periodic, while periodic processes have sample functions that are periodic. 


Definition 9.6-2 A random process X(t) is wide-sense cyclostationary if there exists 
a positive value T such that 


fix (t) =fx(t+T) for allt 


Sec. 9.6. PERIODIC AND CYCLOSTATIONARY PROCESSES 601 


OE 
GGG 


ty 


Figure 9.6-1 Possible contours of the covariance function of a wide-sense (WS) periodic random 
process. 


and 


Kx x (ti, ta) = Kxx(ti +7,to+ T) for all t; and tg. 


An example of cyclostationarity is the random PSK process of Equation 9.2-11. Its 
mean function is zero and hence trivially periodic. Its covariance function (Equation 9.2-13) 
is invariant to a shift by T in both its arguments. Note that Equation 9.2-13 is not doubly— 
periodic since Rx x(0,T) =0 4 Rxx(0,0). Also note that the sample functions of X(t) are 
not periodic in any sense. 

The constant-value contours of the covariance function of a typical cyclostationary 
random process are shown in Figure 9.6-2. Note the difference between this configuration 
and that of a periodic random process, as shown in Figure 9.6-1. Effectively, cyclostationarity 
means that the statistics are periodic, but the process itself is not periodic. 

By averaging along 45° lines (i.e., t; = t2), we can get the WSS versions of both types 
of processes. The contours of constant density of the periodic process then become the 
straight lines of the WSS periodic process shown in Figure 9.6-3. The WSS version of a 
cyclostationary process just becomes an ordinary WSS process, because of the lack of any 
periodic structure along 135° (anti-diagonal) lines (i.e., t; = —t2). 

In addition to modulators, scanning sensors tend to produce cyclostationary processes. 
For example, the line-by-line scanning in television transforms the random image field into 
a one-dimensional random process that has been modeled as cyclostationary. In communica- 
tions, cyclostationarity often arises due to waveform repetition at the baud or 
symbol rate. 
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Figure 9.6-3 Possible contour plot of covariance function of WSS periodic random process. (Solid 
lines are maxima; dashed lines are minima.) 


A place where cyclostationarity arises in signal processing is when a stationary random 
sequence is analyzed by a filter bank and subsampled. The subsequent filter bank synthesis 
involves upsampling and reconstruction filters. If the subsampling period is N, then the 
resulting synthesized random sequence will be cyclostationary with period NV. When perfect 
reconstruction filters are used, then true stationarity will be achieved for the synthesized 
output. 
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While cyclostationary processes are not stationary or WSS except in trivial cases, it is 
sometimes appropriate to convert a cyclostationary process into a stationary process as in 
the following example. 


Example 9.6-1 
(WSS PSK) We have seen that the PSK process of Section 9.2 is cyclostationary and 
hence not WSS. This is easily seen with reference to Equation 9.2-13. This cyclostationarity 
arises from the fact that the analog angle process 0,(t) is stepwise constant and changes 
only at t = nT for integer n. In many real situations the modulation process starts at an 
arbitrary time t, which in fact can be modeled as random from the viewpoint of the system 
designer. Thus in this practical case, the modulated signal process (Equation 9.2-11) is 
converted to 


X(t) = cos (Qrfet + Oa(t) + 2zfeTo) , (9.6-2) 


by the addition of a random variable To, which is uniformly distributed on [0,7] and inde- 
pendent of the angle process O,(t). It is then easy to see that the mean and covariance 
functions need only to be modified by an ensemble average over To, which by the uniformity 
of To is just an integral over [0,7]. We thus obtain 


1 T 
Ryeg(tit7. ti) = z/ Rxx(tit7r+t,ti +t)dt 
0 


a 
_ z/ p(n do dee ai 


ik 


= sal) * 8Q(—T), (9.6-3) 


which is just a function of the shift r. Thus X(t) is a WSS random process. 


Example 9.6-2 
(power spectral density of PSK) A WSS version of the random PSK signal was defined in 
Example 9.6-1 through an averaging process, where the average was taken over the message 
time or baud interval T. The resulting WSS random process X(t) had correlation function 
(Equation 9.6-3) given as 


Rgx(r) = asq(r) * 89(-7) 


where sg(T) was given as 


sin(Q7f.7), O<7<T, 
sq(r) = 0, else. 
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Figure 9.6-4 Power spectral density of PSK plotted for f£ = 2.5 and T= 0.5. 


Then the psd of this WSS version of PSK can be calculated as 


2 
8 
mS, 
<= 
a™~ 


sin 2n fF), (sinw+2nf)F\” 
(w+ Qn f.)F . 


for f.T >> 1, (9.6-4) 


which can be plotted’ using MATLAB. The file psd_PSK.m included on this book’s Web site. 

Some plots were made using psd_PSK.m, for two different sets of values for f. and T. 
First we look at the psd plot in Figure 9.6-4 for f. = 2.5 and T = 0.5, which gives consid- 
erable overlap of the positive and negative frequency lobes of Sz ¢(w). The lack of power 
concentration at the carrier frequency f, is not surprising, since there is only a little over 
one period of sg(t) in the baud interval 7. The next pair of plots show a quite different case 
with power strongly concentrated at w,.. This plot was computed with the values f, = 3.0 
and T' = 5.0, thus giving 15 periods of the sine wave in the baud interval T. Figure 9.6-5 is 
a linear plot, while Figure 9.6-6 shows Sz z(w) on a logarithmic scale. 


+The reason for the approximate equals sign is that we have neglected the cross-term in Equation 9.6-4 
between the two sinc terms at +f., as is appropriate for f.T >> 1. 
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Figure 9.6-5 Power spectral density of PSK plotted for f =3 and T=5. 


10° 
10-2 
10-4 
10° 
10-8 


10710 


log Sxxw) 


10-12 
10°14 
10-16 
10-18 


50 -40 -30 -20 -10 0 10 20 30 40 50 
we = 18.8496, T=5 


Figure 9.6-6 Log of power spectral density of PSK plotted for f. = 3 and T=5. 
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9.7 VECTOR PROCESSES AND STATE EQUATIONS 


In this section we will generalize some of the results of Section 9.5 to the important class of 
vector random processes. This will lead into a brief discussion of state equations and vector 
Markov processes. Vector random processes occur in two-channel systems that are used in 
communications to model the in-phase and quadrature components of bandpass signals. 
Vector processes are also used extensively in control systems to model industrial processes 
with several inputs and outputs. Also, vector models are created artificially from high-order 
scalar models in order to employ the useful concept of state in both estimation and control 
theory. 

Let X,(t) and X2(t) be two jointly stationary random processes that are input to the 
systems H, and Ho, respectively. Call the outputs Y; and Y2, as shown in Figure 9.7-1. 

From earlier discussions we know how to calculate Rx,y,, Rx.y,, Ry,y,, Ry,y,. We 
now look at how to calculate the correlations across the systems, that is, Rx,y,, Rx.y,, 
and Ry,y,. Given Rx,x,, we first calculate 


Rxyvo(T) = E[Xi (t+ 7) ¥9'(t)] 


_ /  ELXA(t-+ 7) XE(t — B)IAS(B)AB 


7 i: Rx,x,(r + B)h3(B)a8 


_ a bos oe (r ~~ B')h3(—B') dp", (3 = =); 
sO 
fixe, (rT) = Rx,x, (rT) ms h3(—T), 
and by symmetry 
Rx, (7) = Axx, (T) * hi(—T). 


The cross-correlation at the outputs is 


Ry, ys (r) = hi(T) * Rx, xl) = h3(—7). 


X(t Y,(t) 
ee 2) L_" 
X(t) Y(t 
a —| H,(@) Eee 


Figure 9.7-1 A generic (uncoupled) two-channel LSI system. 
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Figure 9.7-2 General two-channel LSI system. 


Expressing these results in the spectral domain, we have 
Sx. (w) = Ox Re (w) M3 (w) 


and 
SY, Ys (w) = Ay (w)H3 (w) Sx, x, (w). 


In passing, we note the following important fact: If the supports' of the two system functions 

A, and Hz do not overlap, then Y; and Y2 are orthogonal random processes independent of 

any correlation in the input processes. We can generalize the above to a two-channel system 

with internal coupling as seen in Figure 9.7-2. Here two additional system functions have 

been added to cross-couple the inputs and outputs. They are denoted by Hj. and Hy). 
This case is best treated with vector notation; thus we define 


XO 2110,x207, YO 2M, YO, 
and 


haa(t) hao(t) 
h(t) = es ak 


where h,;;(t) is the impulse response of the subsystem with frequency response H;,;(w). We 
then have 
Y(t) = h(t) « X(t), (9.7-1) 


+ We recall that the support of a function g is defined as 


supp(g) 5 {alg(x) 4 0}. 
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where the vector convolution is defined by 


N 
(In(t) * X(t); = S~ hij (t) * X; (0). 


71 


If we define the following relevant input and output correlation matrices 


a [Rx.x,(r) Rx.x,(7) 
Rete | a As a (9.7.2) 


A Ry, (7) Ry, (7 
~ be iO Baw. | : (9.7.3) 


one can show that (Problem 9.44) 
Ryy(r) = h(r) * Rxx(r) *h'(-7), (9.7-4) 


where the { indicates the Hermitian (or conjugate) transpose. 
Taking the matrix Fourier transformation, we obtain 


Syy(w) = H(w)Sxx(w)H" (w) (9.7-5) 


with 
H(w) = FT{h(t)}, 


and 
S(w) = FT{R(r)}, 


where this notation is meant to imply an element-by-element Fourier transform. This multi- 
channel generalization clearly extends to the M input and N output case by just enlarging 
the matrix dimensions accordingly. 


State Equations 


As shown in Problem 9.43, it is possible to rewrite an Nth-order LCCDE in the form of 
a first-order vector differential equation where the dimension of the output vector is equal 
to N, 
Y(t) = AY(t)+BX(t), | -coo <t <+oo. (9.7-6) 
This is just a multichannel system as seen in Equation 9.7-1 and can be interpreted as a set 
of N coupled first-order LCCDEs. We can take the vector Fourier transform and calculate 
the system function 
H(w) = (jwI — A)~'B (9.7-7) 


to specify this LSI operation in the frequency domain. Here I is the identity matrix. Alter- 
nately, we can express the operation in terms of a matrix convolution 


Y(t) = h(t) « X(t), 
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where we assume the multichannel system is stable; that is, all the impulse responses h;; are 
BIBO stable. The solution proceeds much the same as in the scalar case for the first-order 
equation; in fact, it can be shown that 


h(t) = exp(At) Bu(t). (9.7-8) 


The matrix exponential function exp(At) was encountered earlier in this chapter in the 

solution of the probability vector for a continuous-time Markov chain. This function is 

widely used in linear system theory, where its properties have been studied extensively [9-3]. 
If we compute the cross-correlation matrices in the WSS case, we obtain 


Ryx(rT) = exp(Ar) Bu(r) * Rxx(r) 


and 
Rxy(T) = Rxx(rT) * Bi exp(—A'r)u(—1), 


with output correlation matrix, as before, 
Ryy (rt) = h(t) * Rxx(r) * h'(—7). (9.7-9) 
Upon vector Fourier transformation, this becomes 
Syy(w) = (jwI — A)~*BSxx(w)B!(—jwI — At)~’. (9.7-10) 


If Rxx(T) = Qd(r), then since the system H is assumed causal, that is, h(t) = 0 for t < 0, 
we have that the cross-correlation matrix R. yx(rT) = 0 for rT < 0; that is, E[Y (t+7)X"(t)] = 
0 for rT < 0. In words we say that Y(t +7) is orthogonal to X(t) for r < 0. Thus, the 
past of Y(t) is orthogonal to the present and future of X(t). If we additionally assume 
that the input process X(t) is a Gaussian process, then the uncorrelatedness condition 
becomes an independence condition. Under the Gaussian assumption then, the output Y(t) 
is independent of the present and future of X(t). A similar result was noted earlier in the 
scalar-valued case. We can use this result to show that the solution to a first-order vector 
LCCDE is a vector Markov random process with the following definition. 


Definition 9.7-1 (vector Markov) A random process Y(t) is vector Markov if for all 
n >O and for all t, > tn—1 >... > t1, and for all values y(tn_1),...,y(t1), we have 


PLY (tn) < Ynl¥(tn—1), os »y(t1)] = P[Y (tn) < Ynl¥(tn—1)] 
for all values of the real vector y,,. Here A < a means 
(An < Gn, An—1 < @n-1,---, Ay < a). /i23) 


Before discussing vector differential equations we briefly recall a result for deterministic 
vector LCCDEs. The first-order vector equation, 


y(t) =Ay() + Bx(t), — t 2 to, 
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subject to the initial condition y(to), can be shown to have solution, employing the matrix 
exponential 


y(t) = exp[A(t — to) ]y(to) + i h(t — v)x(v)dv, b> t93 


to 


thus generalizing the scalar case. This deterministic solution can be found in any graduate 
text on linear systems theory, for example, in [9-3]. The first term is called the zero-input 
solution and the second term is called the zero-state (or driven) solution analogously to the 
solution for scalar LCCDEs. 

We can extend this theory to the stochastic case by considering the differential 
Equation 9.7-6 over the semi-infinite domain ty < t < oo and replacing the above determin- 
istic solution with the following stochastic solution, expressed with the help of an integral: 


t 
Y(t) = exp[A(t — to) /Y(to) + / h(t — v)X(v)dv. (9.7-11) 
to 
If the LCCDE is BIBO stable, that is, the real parts of the eigenvalues of A are all 
negative, in the limit as tg — —oo, we get the solution for all time, that is tg = —oo, 
t 
Y(t) = / h(t — v)X(v)du = h(t) * X(8), (9.7-12) 


which is the same as already derived for the stationary infinite time-interval case. In effect, 
we use the stability of the system to conclude that the resulting zero-input part of the 
solution must be zero at any finite time. 

The following theorem shows a method to generate a vector Gauss—Markov random 
process using the above approach. The input is now a white Gaussian vector process W(t) 
and the output vector Markov process is denoted by X(t). 


Theorem 9.7-1 Let the input to the state equation 


° 


X(t) = AX(t) + BW(t) 


be the white Gaussian process W(t). Then the output X(t) is a vector Gauss-Markov 
random process. 


Proof We write the solution at t,, in terms of the solution at an earlier time t,_ 1 as 


X(tn) = exp[A(t, ~~ tn—1)|X(tn—1) + / ‘ h(t = v)W(v)dv. 


tn—-1 
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Then we write the integral term as I(t,,) and note that it is independent of X(t,_1). Thus 
we can deduce that 


P[X(tn) < Xnbe(ty—1),---(t1)] 
= P[l(tn) <xq— elt —>—1)36(ty_1) etn a)s x(t) 
= PlI(tn) S Xn — eG) x(tn_1)[X(En—1)] 

and hence that X(t) is a vector Markov process. [ll 


If in Theorem 9.7-1 we did not have the Gaussian condition on the input W(t) but 
just the white noise condition, then we could not conclude that the output was Markov. 
This is because we would not have the independence condition required in the proof but 
only the weaker uncorrelatedness condition. On the other hand, if we relax the Gaussian 
condition but require that the input W(t) be an independent random process, then the 
process X(t) would still be Markov, but not Gauss-Markov. We use X for the process in 
this theorem rather than Y to highlight the fact that LCCDEs are often used to model 
input processes too. 


SUMMARY 


In this chapter we introduced the concept of the random process, an ensemble of functions 
of a continuous parameter. The parameter is most often time but can be position or another 
continuous variable. Most topics in this chapter generalize to two- and three-dimensional 
parameters. Many modern applications, in fact, require a two-dimensional parameter, for 
example, the intensity function i(t1,t2) of an image. Such random functions are called 
random fields and can be analyzed using extensions of the methods of this chapter. Random 
fields are discussed in Chapter 7 of [9-5] and in [9-8] among many other places. 

We introduced a number of important processes: asynchronous binary signaling; the 
Poisson counting process; the random telegraph signal; phase-shift keying, which is basic 
to digital communications; the Wiener process, our first example of a Gaussian random 
process and a basic building block process in nonlinear filter theory; and the Markov process, 
which is widely used for its efficiency and tractability and is the signal model in the widely 
employed Kalman-Bucy filter of Chapter 11. 

We considered the effect of linear systems on the second-order properties of random 
processes. We specialized our results to the useful subcategory of stationary and WSS 
processes and introduced the power spectral density and the corresponding analysis for LSI 
systems. We also briefly considered the classes of wide-sense periodic and cyclostationary 
processes and introduced random vector processes and systems and extended the Markov 
model to them. 


PROBLEMS 


(*Starred problems are more advanced and may require more work and/or additional 
reading.) 
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9.1 


9.2 


9.3 


Let X[n] be a real-valued stationary random sequence with mean E{X[n]} = py 
and autocorrelation function E{X[n + m|X[n]} = Rxx|mJ. If X[n] is the input to 
a D/A converter, the continuous-time output can be idealized as the analog random 
process Xq(t) with 


A 


X,(t)= X[n] forn<t<n-+1, forall n, 


as shown in Figure P9.1. 


1 ~— 4 5 6 L 8 9 


Figure P9.1_ Typical output of sample-hold D/A converter. 


(a) Find the mean E[X,(t)] = ,(t) as a function of py. 

(b) Find the correlation E[Xq(ti1)Xa(t2)] = Rx, x, (t1,t2) in terms of Rxx[ml. 
Consider a WSS random sequence X[n] with mean function jx, a constant, and 
correlation function Rx x[m]. Form a random process as 


ci sin a(t — n 
X(t) 2 y Xp oo < t < +00. 


n=—Cco 


In what follows, we assume the infinite sums converge and so, do not worry about 
stochastic convergence issues. 


a) Find t) in terms of wx. Simplify your answer as much as possible. 
Mx x 
(b) Find Rxx(ti, tz) in terms of Rxx[m]. Is X(t) WSS? 


Hint: The sampling theorem from Linear Systems Theory states that any bandlim- 
ited deterministic function g(t) can be recovered exactly from its evenly spaced 
samples, that is, 
+00 ~ 
sin a(t — nT)/T 
t => T — 
g(t)= S- g(nT) ab onDVT 


when the radian bandwidth of the function g(t) is 7/T or less. 
Let B[n] be a Bernoulli random sequence equally likely taking on values +1. Then 
define the random process 


x(t) & /psin (27 fot rs Bin]5 ) for nT <t<(n+1)T, forall n, 


where ,/p and fo are real numbers. 
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9.4 


9.5 


(a) Determine the mean function jy (t). 
(b) Determine the covariance function Kx x(t, t2). 


The output Y(t) of a tapped delay-line filter shown in Figure P9.4, with input X(t) 
and N taps, is given by 


N-1 


Y(t) = 50 AnX(t— nT). 


n=0 


Figure P9.4 Tapped delay-line filter. 


The input X(t) is a stationary Gaussian random process with zero mean and autocor- 
relation function Rxx(T) having the property that Rx x(nT) = 0 for every integer 
n#0. The tap gains A,,n =0,1,...,N—1, are zero-mean, uncorrelated Gaussian 
random variables with common variance 04. Every tap gain is independent of the 
input process X(t). 


(a) Find the autocorrelation function of Y(t). 

(b) For a given value of t, find the characteristic function of Y(t). Justify your 
steps. 

(c) For fixed t, what is the asymptotic pdf of gr (), asymptotic as N — co? 
Explain. 

(d) Suppose now that the number of taps N is a Poisson random variable with 
mean (> 0). Find the answers to parts (a) and (b) now. 

(Note: You may need to use the following: e~* © ;+. for |z| << 1, and e? = 


n 


Sear 


Let N(t) be a Poisson random process defined on 0 < t < co with N(0) = 0 and 
mean arrival rate A > 0. 


(a) Find the joint probability P[N(t,) = 1, N(t2) = ng] for ta > ty. 
(b) Find an expression for the Ath order joint PMF, 


Prag Wis saa g WG Tg sg hes 


with 0 < t) < tg <...<tx < oo. Be careful to consider the relative values 
of n1,...,NK. 


614 Chapter 9 Random Processes 


*9.6 The nonuniform Poisson counting process N(t) is defined for t > 0 as follows: 
(a) N(0) =0. 
(b) N(t) has independent increments. 
(c) For all tog > ty. 


P(N (tz) — N(t1) =n] = i, Ae exp ( f Avjav) , forn>0. 


nl 


The function A(t) is called the intensity function and is everywhere nonnegative, 
that is, A(t) > 0 for all t. 


(a) Find the mean function joy(t) of the nonuniform Poisson process. 
(b) Find the correlation function Ryw(ti,t2) of N(t). Define a warping of the 
time axis as follows: 


r(t) 5 | Nwv)dv. 


Now 7(t) is monotonic increasing if \(v) > 0 for all v, so we can then define 
the inverse mapping t(T) as shown in Figure P9.6. 


z(t) 


0 t(7) 


Figure P9.6 Plot of 7 versus t. 


(c) Assume A(t) > 0 for all t and define the counting process, 


Nu(r) 2 N(t(7)). 


Show that N,,(7) is a uniform Poisson counting process with rate \ = 1; that 
is, show for Tt > 0 


(1) N,(0) =0. 
(2) Nu(7) has independent increments. 
(3) For all t2 > 71, 


P(N. (T2) — Nu(t1) = n] = aS ee) ASO. 
nr: 


9.7 A nonuniform Poisson process N(t) has intensity function (mean arrival rate) 
A(t) = 1+ 2t, 
for t > 0. Initially N(0) = 0. 
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a) Find the mean function p(t). 

b) Find the correlation function Ry jn (ti, tz). 

(c) Find an expression for the probability that N(t) > t, that is, find P[N(t) > ¢] 
for any t > 0. 

(d) Give an approximate answer for (c) in terms of the error function erf(zx). 


*9.8 This problem concerns the construction of the Poisson counting process as given in 
Section 9.2. 


(a) Show the density for the nth arrival time T[n] is 
Ner-1 


mone Me n> 0. 


fr(tn) = 


In the derivation of the property that the increments of a Poisson process are 
Poisson distributed, that is, 


Nis =e)? 
ta = tI" reat, 


P[X (ta) — X(t) =n] = 7 lnk ta ty 
TH 
———_—____+—__}—_____- 
Ti-4 th T; ta 
USS 


T til 


Figure P9.8 _ Illustrative example of relation of arrival times to arbitrary observation interval. 


we implicitly use the fact that the first interarrival time in (ty, ta] is exponen- 
tially distributed. Actually, this fact is not clear as the interarrival time in 
question is only partially in the interval (t,,¢,]. A pictorial diagram is shown 
in Figure P9.8. Define 7’[é] S T[i] — ty as the partial interarrival time. We 
note T'[i] = T|i] — T, where the random variable T St Tii — 1] and 7[?] 
denotes the (full) interarrival time. 

(b) Fix the random variable T = ¢ and find the CDF 


Fring (7'|T = t) = P{r[i] <7’ + t|7 [2] > th. 


(c) Modify the result of part (b) to account for the fact that T is a random 
variable, and find the unconditional CDF of T’. (Hint: This part does not 
involve a lot of calculations.) 


Because of the preceding properties, the exponential distribution is called memory- 
less. It is the only continuous distribution with this property. 
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9.9 


9.10 


9.11 


Let N(t) be a counting process on [0, co) whose average rate A(t) depends on another 
positive random process S(t), specifically A(t) = S(t). We assume that N(t) given 
{S(t) on [0,00)} is a nonuniform Poisson process. We know jug(t) = fg > 0 and also 
know Kgg(t1, ta). 


(a) Find y(t) for t > 0 in terms of pup. 
(b) Find o%,(t) for t > 0 in terms of Kgg(t1, tz). 


Let the random process K(t) (not a covariance!) depend on a uniform Poisson process 
N(t), with mean arrival rate \ > 0, as follows: Starting at t = 0, both N(t) = 0 and 
K(t) = 0. When an arrival occurs in N(t), an independent Bernoulli trial takes place 
with probability of success p, where 0 < p < 1. On success, K(t) is incremented by 1, 
otherwise K(t) is left unchanged. This arrangement is shown in Figure P9.11. Find 
the first-order PMF of the discrete-valued random process A(t) at time t, that is, 
Px(k;t), for t > 0. 


Poisson N(t) Bernoulli Kit) 
process trial 
generator generator 


Figure P9.11 Poisson-modulated Bernoulli trial process. 


Let the scan-line of an image be described by the spatial random process $(), which 
models the ideal gray level at the point «. Let us transmit each point independently 
with an optical channel by modulating the intensity of a photon source: 


A(t, 2) = S(x) + ro, 0<t<T. 


In this way we create a family of random processes, indexed by the continuous 
parameter 2, 


{N(t,2)}. 


For each 2, N(t,2) given S(x) is a uniform Poisson process. At the end of the 


observation interval, we store N (a) SN (T,a) and inquire about the statistics of 
this spatial process. 

To summarize, N(a) is an integer-valued spatial random process that depends on 
the value of another random process S(a), called the signal process. The spatial 
random process 5(a) is stationary with zero mean and covariance function 


Kgs(x) = 0% exp(—alz)), 
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9.12 


*9.13 


where a > 0. The conditional distribution of N(x), given S(a) = s(x), is Poisson 
with mean A(x) = (s(a) + Ao)T, where Xo is a positive constant; that is, 
r" (2) 


i eran]. 


P[N(2) = n|S(x) = s(a)] = 


The random variables N (a) are conditionally independent from point to point. 


(a) Find the (unconditional) mean and variance 
py (0) = E[N(2)] and B[(N(e) = py (a))?]. 


(Hint: First find the conditional mean and conditional mean square.) 
A 


(b) Find Ryn(a1, 02) = E[N(21)N(22)]. 
Let X(t) be a random telegraph signal (RTS) defined on t > 0. Fix X(0) = +1. The 
RTS uses a Poisson random arrival time sequence T[n] to switch the value of X(t) 
between +1. Take the average arrival rate as \(> 0). Thus we have 


i, CA e Til 
AJ-l1, Til]<t<T[P] 
~ )4i1, Ti) <t<TI[3} 


? 


X(t) 


(a) Argue that X(t) is a Markov process and draw and label the state-transition 
diagram. 

(b) Find the steady-state probability that X(t) = +1, that is, Px (1; 00), in terms 
of the rate parameter 1. 

(c) Write the differential equations for the state probabilities Px(1;t) and 


A uniform Poisson process N(t) with rate \(> 0) is an infinite-state Markov chain 
with the state-transition diagram in Figure P9.13a. Here the state labels are the 
values of the process (chain) N(t) between the transitions. Also the independent 
interarrival times 7[n] are exponentially distributed with parameter 4. 


Figure P9.13a Poisson process represented as Markov chain. 


We make the following modifications to the above scenario. Replace the independent 
interarrival times t[n] by an arbitrary nonnegative, stationary, and independent 
random sequence, still denoted r[n], resulting in the generalization called a renewal 
process in the literature. See Figure P9.13b. 


618 


Chapter 9 Random Processes 


9.14 


9.15 


*9.16 


*9.17 


Figure P9.13b More general (renewal) process chain. 


(a) Show that the PMF Py(n;t) = P[LN(t) =n] of a renewal process is given, in 
terms of the CDF of the arrival times F(t; 7), as 


Py(n;t) = Fr(t;n) — Fr(t;n+1), whenn>1, 


where the arrival time T'[n] = >>), T[k] and Fp(t;n) is the corresponding 
CDF of the arrival time T[n]. 

(b) Let T[n] be U[0, 1], that is, uniformly distributed over [0,1], and find Py (n; t) 
for n = 0,1, and 2, for this specific renewal process. 

(c) Find the characteristic function of the renewal process of part (b). 

(d) Find an approximate expression for the CDF F'p(t;n) of the renewal process 
in part (b), that is good for large n, and not too far from the T[/n] mean 
value. (Hint: For small x we have the trigonometric series approximation 
sing © x — 23/3!) 

Let W(t) be a standard Wiener process, defined over [0,00) (i.e., distributed as 
N(O0,t) at time t). Find the joint density fw(a1, a2; t1, t2) for 0 < t) < te. 

Let Wi(t) and W2(t) be two Wiener processes, independent of one another, both 
defined on ¢t > 0, with variance parameters a; and ag, respectively. Let the process 
X(t) be defined as their algebraic difference, that is, X(t) 4 W(t) — Wo(t). 

(a) What is Rx x (ti, te) for t1, te = 0? 

(b) What is the pdf fx(a#;t) for t > 0? 

Let the random process X(t) have mean function x(t) = 4 and covariance function 
Kxx(ti, tz) = 5{min(t1, t2)|?. Let the derivative process be denoted by Y(t) = X’(t) 
for t > 0. 


(a) Find the mean function of Y(t). 

(b) Find the correlation function Ryy(t1, t2). 

(c) Is the derivative process Y(t) wide-sense stationary (WSS)? 

(d) Show that the above X(t) process actually exists by constructing it from 


standard Wiener process(es). 
Let W(t) be a standard Wiener process, that is, a = 1, and define 


X(t) 2 W(t) for t>0. 


(a) Find the probability density fx (a; t). 

(b) Find the conditional probability density fx (x2|#1; te, t1), to > f4. 
(c) Is X(t) Markov? Why? 

(d) Does X(t) have independent increments? Justify. 
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9.18 Let X(t) be a Markov random process on [0,0o) with initial density fx(#;0) = 
6(a@ — 1) and conditional pdf 


fx 


(a) 
(b) 


1 aaa) 
|"; to,t1) = ex : for all tg > fy. 
( 2| 1; l2 1) Inlay) »( 0 h-hh 2 1 


Find fx(a;t) for all t. 
Repeat part (a) for fx(#;0) ~ N(0,1). 


9.19 Consider the three-state Markov chain N(t) with the state-transition diagram shown 
in Figure P9.19. Here the state labels are the actual outputs, eg. N(t) = 3, while 
the chain is in state 3. The state transitions are governed by jointly independent, 
exponentially distributed interarrival times, with average rates as indicated on the 
branches. 


(a) 


(b) 


(c) 


Given that we start in state 2 at time t = 0, what is the probability (condi- 
tional probability) that we remain in this state until time t, for some arbitrary 
t > 0? (Hint: There are two ways to leave state 2. So you will leave at the 
lesser of the two independent exponential random variables with rates j. 
and 2.) 

Write the differential equations for the probability of being in state 2 at time 
t > 0, denoting them as p;(t), i = 1,2,3. [Hint: First write p;(t+ ot) in terms 
of the p;(t), 2 = 1,2,3, only keeping terms up to order O(6t).] 

Find the steady-state solution for p;(t) for 1 = 1, 2,3, that is, p;(oo). 


M do 


M2 M3 


Figure P9.19 A three-state continuous-time Markov chain. 


9.20 Let a certain wireless communication binary channel be in a good state or bad state, 
described by the continuous-time Markov chain with transition rates as shown in 
Figure P9.20. Here we are given that the exponentially distributed state transitions 
have rates \; = 1 and \2 = 9. The value of ¢€ for each state is given in part (b) 


below. 


(a) 


(b) 


Find the steady-state probability that the channel is in good state. Label 
P{X(t) = good } = pg, and P{X(t) = bad } = p,. (Hint: Assume the 
steady state exists and then write p, at time ¢t in terms of the two possibilities 
at time t — 0, keeping only terms to first order in 6, taken as very small.) 
Assume that in the good state, there are no errors on the binary channel, but 
in the bad state the probability of error is « = 0.01 Find the average error 
probability on the channel. (Assume that the channel does not change state 
during the transmission of each single bit.) 
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9.21 


9.22 
9.23 


9.24 


9.25 


9.26 


M 1-e€ 
0 0 
(v0 . 
1 1 
1-€ 
Figure P9.20 Model of two-state wireless communication channel. 


This problem concerns the Chapman—Kolmogorov equation (cf. Equation 9.2-22) for 
a continuous-amplitude Markov random process _X (t), 


+00 
Fre(alts)a(ts)) = [ fx (x(ts)|2(ta)) fx (w(t) |(t1)) dx(ta), 
for the conditional pdf at three increasing observation times t3 > to > t; > 0. You 
will show that the pdf of the Wiener process with covariance function Kx x (t,s) = 
amin(t,s), a > 0, solves the above equation. 


(a) Write the first-order pdf fx (a(t)) of this Wiener process for t > 0. 

(b) Write the first-order conditional pdf fx (#(t)|a(s)), t>s > 0. 

(c) Referring back to the Chapman—Kolmogorov equation, set t3—tg = tg—t, = 6 
and use x3, £2, and x; to denote the values taken on. Then verify that your 
conditional pdf from part (b) satisfies the resulting equation 


+oo 


fses= i fx (sla) fox (wolarr) dra. 


Is the random process X'(t) of Example 9.3-2 stationary? Why? 
Let A and B be i.i.d. random variables with mean 0, variance 07, and third moment 
m3 & E|[A3] = E[B?] 4 0. Consider the random process 


X(t) = Acos(2a ft) + Bsin(2z ft), —oo <t< +00, 
where f is a given frequency. 


(a) Show that the random process X(t) is WSS. 
(b) Show that X(t) is not strictly stationary. 


Earlier we proved Theorem 9.3-2, thus deriving Equation 9.3-5. State and prove a 
corresponding theorem for covariance functions. Do not assume jiy(t) = 0. 

Let X(t) be a stationary random process with mean jzy and covariance Ky x(T) = 
6(r). Let the sample functions of X(t) drive the differential equation 


Y+aY(t) = X(d), a >0,-c <t<-+o. 


(a) Find py(t) = EIY(t)] 
(b) Find Ryy(t . 
(c) Find o7(t). 
Is the random process X(t) generated by D/A conversion in Problem 9.1 WSS? 
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9.27 The psd of a random process is given as Sxx(w) = 1/(w? +9) for —oo < w < +00. 
Find its autocorrelation function Rx x(T). 

9.28 Consider the LSI system shown in Figure P9.28, whose input is the zero-mean 
random process W(t) and whose output is the random process X(t). The frequency 
response of the system is H(w). Given Kww/(r) = 6(r), find H(w) in terms of the 
cross-covariance Kx w(T) or its Fourier transform. 


W (t) H (w) X(t) 


Figure P9.28 LSI system with white noise input. 


9.29 Let a random process Y(t) be given as 


dX (t 
Y(t) = X(t) + 0.3 au 00 <t < +00, 
where X(t) is a random process with mean function y(t) = 5t, and covariance 
function 
aa 
Kx x(t, t2) = ——————,, a>0. 


7 14+ a(ti — te)?’ 


(a) Find the mean function juy-(t). 
(b) Find the covariance function Kyy (t1, ta). 
(c) Is the random process Y(t) WSS? Why? 


9.30 Consider the first-order stochastic differential equation 


dX(t) 
— + X(t) = We) 


driven by the zero-mean white noise W(t) with correlation function Rww(t,s) = 
d(t — s). 


(a) If this differential equation is valid for all time, —oo < t < +00, find the psd 
of the resulting wide-sense stationary process X(t). 

(b) Using residue theory (or any other method), find the inverse Fourier transform 
of Sxx(w), the autocorrelation function Rx x(T), —co < T < +00. 

(c) If the above differential equation is run only for t > 0, is it possible to 
choose an initial condition random variable X(0) such that X(t) is wide- 
sense stationary for all t > 0? If such a random variable exists, find its mean 
and variance. You may assume that the random variable X(0) is orthogonal 
to W(t) on t > 0; that is, X(0) L W(t). [Hint: Express X(t) for t > 0 in 
terms of the initial condition and a stochastic integral involving W(t).] 
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9.31 


9.32 


9.33 


9.34 


Show that h(7) * h*(—r) is a positive semidefinite function by working directly with 
the definition and exclusively in the time domain. Assume that the function h(t) is 
square integrable, that is, he |h(t)|?dt < 00. 
Let the random process X(t) with mean value 128 and covariance function 

Kx x(r) = 1000 exp(—10|7]) 
be filtered by the lowpass filter 


to produce the output process Y(t). 
(a) Find the mean function juy-(t). 
(b) Find the covariance Kyy(r). 


Consider the continuous-time system with input random process X(t) and output 
process Y(t): 
1 +2 
Y(t) = - X(t —s) ds. 
a Jo3 
Assume that the input X(t) is WSS with psd Sy x(w) = 2 for —oo < w < +00. 
(a) Find the psd of the output Syy(w). 
(b) Find Ryy(r), the correlation function of the output. 


A WSS and zero-mean random process Y(t) has sample functions consisting of 
successive rectangular pulses of random amplitude and duration as shown in Figure 
P9.34. 


Y(t) 


Figure P9.34 Random amplitude pulse train. 


The pdf for the pulse width is 
Aer’, w>O0, 
Jw(w) = { 0, w<0d, 
with \ > 0. The amplitude of each pulse is a random variable X (independent 
of W) with mean 0 and variance 0%. Successive amplitudes and pulse widths are 
independent. 
(a) Find the autocorrelation function Ryy(r) = E[Y(t+7r)Y(0)]. 
(b) Find the corresponding psd Syy(w). 
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9.35 


*9.36 


(Hint: First find the conditional autocorrelation function E[Y(t+7)Y()|W = w], 
where t is assumed to be at the start of a pulse (do this without loss of generality 
per WSS hypothesis for Y (t)).] 

Let X(t) be a WSS random process with mean 1x and covariance function Kx x(T) = 
5cos(107)e~?!7!. The process X(t) is now input to the linear system with system 


function ne 
s 

H SS —O————— 

(5) = Sy ats 430" 


yielding the output process Y(t). 


(a) First, find the input psd Sy x(w). Sketch your answer. 

(b) Write an expression for the average power of X(t) in the frequency range 
wy < |w| < we. You may leave the expression in integral form. 

(c) Find the output psd Syy(w). Sketch your answer. 


In this problem we consider using white noise as an approximation to a smoother 
process (cf. More on White Noise in Section 9.5), which is the input to a lowpass 
filter. The output process from the filter is then investigated to determine the error 
resulting from the white noise approximation. Let the stationary random process 
X(t) have zero mean and autocovariance function 


1 
Kxx(r) = 5 —exp(-lr/r0) 


which can be written as h(r) *h(—T) with h(r) = se t/T0u(T). 


X(t) Y(t) 
——— | G(w) es 


Figure P9.36a Approximation to white noise input to filter. 


(a) Let X(t) be input to the lowpass filter shown in Figure P9.36a, with output 
Y(t). Find the output psd Sy(w), for 


A 1, |u| < wo 
i ‘0 else. 


Wit) Vit) 
———" G(a) —— es 


Figure P9.36b White noise input to filter. 


(b) Alternatively we may, at least formally, excite the system directly with a 
standard white noise W(t), with mean zero and Kww/(r) = 6(r). Call the 
output V(t) as shown in Figure P9.36b. Find the output psd Syy(w). 
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(c) Show that for |woT9| << 1, Syy ~ Syy and find an upper bound on the 
power error 


|Rvv (0) — Ryy (0). 


9.37 Consider the LSI system shown in Figure P9.37. Let X(t) and N(t) be WSS and 
mutually uncorrelated with power spectral densities Sx x(w) and Sj~j(w) and zero 
means. 


N(t) 


X(t) Y(t) 
h(t) 


Figure P9.37 


(a) Find the psd of the output Y(t). 
(b) Find the cross-power spectral density of X and Y, that is, find Sxy(w) and 
Sy x(w). 
(c) Define the error €(t) S Y(t) — X(t) and evaluate the psd of €(t). 
(d) Assume that h(t) = ad(t) and choose the value of a which minimizes E[€?(t)] = 
Ree (0). 
9.38 Let X(t) be a random process defined by 


X(t) a N cos(27 fot + 9), 


where fo is a known frequency and N and © are independent random variables. The 
CF for N is 

yw) = Blet#”%] = exp{Ale — 1]}, 
where \ is a given positive constant (i.e., N is a Poisson random variable). The 
random variable © is uniformly distributed on [—7, +7]. 


(a) Determine the mean function jx (t). 

(b) Determine the covariance function Kx x(t, s). 
(c) Is X(t) WSS? Justify your answer. 

(d) Is X(t) stationary? Justify your answer. 


9.39 Let X(t) be an independent-increment random process defined on ¢ > 0 with initial 


value X(0) = Xo, a random variable. Assume the following CFs exist: E[e?”*°] 4 
®x,(w) and 


Eleio(X ®—Xol(s))) & ®xt)-Xo(s)(w) for t>s. 
(a) On defining E[e##”*] & ® x(1)(w), show that 


© x4) (w) = Ox, (w)® x(t) x, (w). 
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(b) Show that for all tg > t,, the joint characteristic function of X(t) and X(t1) 
is given by 


® x (tz), (t1) (W2,W1) = xg (W1 + W2)® x(t.) x (W1 + W2)® x(t.) x(t) (W2)- 


(c) Apply part (a) to Problem 9.18(b) by using Gaussian characteristic functions. 
9.40 Express the answers to the following questions in terms of pdf’s. 


(a) State the definition of an independent-increments random process. 

(b) State the definition of a Markov random process. 

(c) Prove that any random process that has independent increments also has the 
Markov property. 


9.41 Let X(t) defined over t > 0 have independent increments with mean function 
fbx (t) = fly and covariance function 


Kx x (ti, t2) = 7% (min(ty, t2)), 


where o%,(t) is an increasing function, that is, do%(t)/dt > 0 for all t > 0, called the 


variance function. Note that Var[X(t)] = 0%(t). Fix T > 0 and find the mean and 


covariance functions of Y(t) 4 X(t) — X(T) for all t > T. (Note: For the covariance 
function take t; and tg > T.) 

*9.42 Following Example 9.2-3, use MATLAB to compute a 1000-element sample function 
of the Wiener process X(t) for a = 2 and T = 0.01. 


(a) Use the MATLAB routine hist.m to compute the histogram of X(10) and 
compare it with the ideal Gaussian pdf. 

(b) Estimate the mean of X(10) using mean.m and the standard deviation using 
std.m and compare them to theoretical values. [Hint: Use Wiener.m' in a 
for loop to calculate 100 realizations of x(1000). Then use hist. Question: 
Why can’t you just use the last 100 elements of the vector x to approximately 
obtain the requested statistics?| 


9.43 Let the WSS random process X(t) be the input to the third-order differential 
equation 
BY ar dy 
dB + ag de + Gy { agY (t) = X(t), 
with WSS output random process Y(t). 


(a) Put this equation into the form of a first-order vector differential equation 


ee AY (t) + BX(t), 
dt 
Y(t 
by defining Y(t) = Y’(t) | and X(t) S [X(t)] and evaluating the matrices 
Y(t) 


A and B. 


tWiener.m is provided on this book’s Web site. 
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(b) Find a first-order matrix-differential equation for Rxy(r) with input 


Rxx(T ). 

(c) Find a first-order matrix-differential equation for Ryy(r) with input 
Rxy(T ). 

(d) Using matrix Fourier transforms, show that the output psd matrix Syy is 
given as 


Syy(w) = (jwI — A)-1BSxx(w)B'(—jwI — At)". 


9.44 Let X(t) be a WSS vector random process, which is input to the LSI system with 
impulse response matrix h(t). 


(a) Show that the correlation matrix of the output Y(t) is given by 
Equation 9.7-4. 
(b) Derive the corresponding equation for matrix covariance functions. 


9.45 In geophysical signal processing one often has to simulate a multichannel random 
process. The following problem brings out an important constraint on the power 
spectral density matrix of such a vector random process. Let the N-dimensional 
vector random process X(t) be WSS with correlation matrix 


Rxx(r) 2 E[X(t +7) X"(t)] 


and power spectral density matrix 


Sxx(w) & FT{Rxx(7)}. 


Here FT{-} denotes the matrix Fourier transform, that is, the (i, j)th component of 
Sxx is the Fourier transform of the (7, 7)th component of Rxx, which is ELX;(t+7) 
X*(t)|, where X;(t) is the ith component of X(t). 


(a) For constants a,,...,a@n define the WSS scalar process 


N 
Y(t) 2 >> a: X.(2). 


Find the power spectral density of Y(t) in terms of the components of the 
matrix Sxx(w). 

(b) Show that the psd matrix Sxx(w) must be a positive semidefinite matrix for 
each fixed w; that is, we must have a’Sxx(w)a* > 0 for all complex column 
vectors a. 


9.46 Consider the linear system shown in Figure P9.46 excited by the two orthogonal, 
zero-mean, jointly WSS random processes X(t), “the signal,” and U(t), “the noise.” 
Then the input to the system G is 


Y(t) = A(t) « X(t) + U(), 
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which models a distorted-signal-in-noise estimation problem. If we pass this Y(t), 
“the received signal” through the filter G, we get an estimate X(t). Finally €(t) can 
be thought of as the “estimation error” 


Figure P9.46 System for evaluating estimation error. 


In this problem we will calculate some relevant power spectral densities and cross- 
power spectral density. 

(a) Find Syy(w). 

) Find Sy ¢(w) = S},(w), in terms of H, G, Sxx, and Suv. 
(c) Find See(w). 

) Use your answer to part (c) to show that to minimize Sz<¢(w) at those frequen- 
cies where 

Sxx(w) >> Suu(¥), 


we should have G + H~! and where 
Sxx(w) << Suu (w) 


we should have G = 0. 


*9.47 Let X(t), the input to the system in Figure P9.47, be a stationary Gaussian random 
process. The power spectral density of Z(t) is measured experimentally and found 
to be 

26 


(w? + 8°)(w? +1) 


X(t) X?(t)= V(t) Z(t) 
Squarer -————> At) > 


h(t)=etu(t) 


Szz(w) = 70(w) + 


Figure P9.47 Squarer nonlinearity followed by linear filter. 


(a) Find the correlation function of Y(t) in terms of (3. 
(b) Find the correlation function of X(t). 
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9.48 Consider the two-state Markov chain N(t) shown in Figure P9.48, taking on values 
1 and 2. While in state 1, the transition time to state 2 has average rate A; = 1. 
In state 2, the transition time to state 1 has average rate Ag = 2. Denote the state 


probabilities as P(t) and P2(t), where P;(t) S P(N (t) = ¢] for i = 1,2. 


Figure P9.48 Two-state Markov chain state-transition diagram. 


(a) Derive the differential equations for the P,(t). 
(b) Find their steady-state solution. 


9.49 The Schwarz inequality for complex-valued random variables states that 


|EIXY*]| < /BUXT) BLY 
for two random variables X and Y. 


(a) Use the Schwarz inequality to derive the corresponding result for WSS random 
processes X(t) and Y(t), 


|Rxy(T)| < Rx x(0) Ryy (0) F 


(b) Find the corresponding result for cross-power spectral densities, 


ISxv(w)| < /Sxx() Syy() . 


Hint: Interpret the result of part (a) in terms of cross- and auto-power 
spectra, and then introduce a narrow bandpass filter centered at an arbitrary 
frequency w. 
9.50 The Wiener process, also called Brownian motion, is the integral of white noise. 
Letting B(t) denote the Wiener process, with W(t) denoting the white noise, we can 
write 


Bit) = [ W(r)dr, t>0. 


Take W(t) to be a standard white noise with correlation function Ry (rT) = d(r). 


(a) Find and sketch the cross-correlation function Rew (ti, ta). 
(b) Find and sketch the autocorrelation function Rga(tr, te). 


9.51 Consider the two-processor reliability problem of Example 9.2-4 in the text, a three- 
state continuous-time Markov random process X(t) with state-transition diagram 
shown in Figure P9.51. Here, X(t) denotes the number of processors “up” at time t. 


PROBLEMS 629 


(a) | Write the state probability vector p(t) differential equation 


dp(t)/dt = Ap(t) 


and explicitly find the generator matrix A. 

(b) We determine the steady-state probability vector p by solving the homogeneous 
matrix-vector equation Ap = 0, subject to the constraint that all the probabil- 
ities in the probability vector p sum to 1. Someone claims that the “proba- 
bility flows” across the dashed vertical lines in Figure P9.51 must balance in 
the steady-state, that is, 2up) = Ap, and pup, = 2Ape2, where the p; are the 
elements of the vector p, that is, the steady-state probabilities of being in state 
i,7=0,1,2. State why this is a reasonable assertion, and prove it by showing 
that the resulting equations satisfy Ap = 0. 


Figure P9.51 


(c) Solve for the numerical steady-state probability values in the case when A = 0.001 
and po = 0.1 per hour. 


9.52 Consider the three-input, two-output LSI system shown in Figure P9.52. The input 
random processes X;(t), X2(t), and U(t) are jointly WSS and pairwise orthogonal, 
that is, X; L X2,X, L U, and Xz L U. We are given the following functions: 
the indicated system functions H,G, and B, plus the three-input power spectral 
densities Sx, x%,,5x,x,, and Syy. You may express your answers in terms of these 
functions. 


X(t) ——>] HW) 
¥,(0) 


X\(1) GW) 


B ( w) -—+(+) > AG ) 
T 


U(t) 
Figure P9.52 
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(a) Find the input/output cross-power spectral density Sy, x, (w). 
(b) Find the input/output cross-power spectral density Sy, x,(w). 
(c) Find the output cross-power spectral density Sy, y,(w). 


9.53 Consider the following tapped delay-line problem. We have a random sequence A, 
for the taps and a WSS random process X(t) as the signal model. Assume the 
total number of taps is N and the tap spacing is T. Assume also that the random 
sequence A, and the random process X(t) are jointly independent. The tapped 
delay-line output is therefore 


N-1 


Y¥(t)= $0 A,X(t— nT). 


n=0 


The correlation function for the random sequence of tap weights is given as R4(n1, 72), 
and the correlation function of the WSS random process is given as Rx(T). 


(a) Find the output correlation function Ry(t), tz) in terms of the given functions 
and parameters. 

(b) Does the wide-sense stationarity of Y(t) depend on whether the random 
sequence A,, is WSS? Justify your answer. 

(c) In finding your result in part (a), is it sufficient that A, and X (t) be uncor- 
related? Why? 


9.54 Let a certain wireless packet channel (Gilbert channel model) having a good state 
and a bad state be modeled as a continuous-time, two-state Markov chain with 
transition rates as given in Figure P9.54. 


hop 


\BG 


Figure P9.54 Gilbert channel model. 


(a) Find the steady-state probability of being in the bad state. 
(b) In the good state, all packets are received. In the bad state, all packets are 
lost. This leads to bursts or clusters of lost packets. In a packet-loss burst, all 
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packets are lost. What is the average length of a packet-loss burst? Justify. 
Note that the chain is in the bad state for the full duration of a packet-loss 
burst. 


9.55 Consider a Poisson random process N(t) with average arrival rate \ = 3. 


(a) Find the probability that N(4) = 2. 
(b) Find the joint probability that N(1) = 1 and N(2) = 2. 


9.56 Consider the system shown in Figure P9.56. 


V[n] 


X[n] | Yin] 
>(+) > Hw) > 


Figure P9.56 


Let X[n] and V[n] be WSS and mutually uncorrelated with zero means and power 
spectral densities Sx x(w) and Syy(w), respectively. 


(a) Find the psd of the output Y[n]. 
(b) Find the cross-power spectral density between input X(t) and output Y(t), 
that is, Sxy(w). 


9.57 Let X(t) and Y(t) be two zero-mean random processes with known correlation coef- 


ficient function 
E[X(t1)¥ *(t2)] 


BUX (t:) PIEUY (ta) "I 


*] 


I> 


pxy (ti, ta) 


2 


and assume that the average powers E||.X(t)|7] = E[|Y (2)| & P,a constant. Next, 
add two random noises U(t) and V(t), jointly orthogonal to X(t) and Y(t), 


X() 4 X(t) +U (0), 
Y(t) Y(t) + V(O), 


where U and V are also orthogonal to each other and of zero mean, and with average 


powers E[|U(t)|?] = E[|V()|’] S €, a constant. Find the correlation coefficient 


function of the tilde processes, that is, px5-(t1,t2) in terms of that of the original 
processes X and Y. 

9.58 Consider the system shown in Figure P9.58. The two-input random sequences are 
WSS and given in terms of their power spectral densities: 


Qw? + 8 
(w2 + 3) (w2 +5) © 


1 
Sxx(w) = as and Svvy(w) = 
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9.59 


*9.60 


9.61 


X[n] | Y[n] 
—es Hw) |__» 


Figure P9.58 System with signal plus noise input. 


The system function H(w) is given as 10[u(w + 1/2) — u(w — 7/2)] over the interval 
[—7,+7], where the function wu is the unit step. Assume that X and V are zero 
mean. 

(a) Assuming X and V are uncorrelated, find the psd of the output random 

sequence Y [n]. 

(b) Let the cross-power spectral density of X and V be specified as 

1 
w? + 5’ 
and find the new output power spectral density of Y. 


Sxv(w) = 


Consider the random process X(t) = cos(wot + ©), where © is a random variable 
uniformly distributed over the interval [0,27], and wo is a fixed frequency. Find 
the first-order pdf fx(a;t). Is the process stationary of first order? Find the 
conditional pdf of X(t2) given X(t1) = x1, which we denote by fx (a2\|21; th, ta). 
You may assume t, < to. 

Let Z(t) = X(t) + jY(t), where X(t) and Y(t) are jointly WSS and real-valued 
random processes. Assume that X(t) and Y(t) are mutually orthogonal with zero- 
mean functions. Define a new random process in terms of a modulation to a carrier 
frequency wo as U(t) = Re{Z(t)e~4“°"}. Given the relevant correlation functions, 
that is, Rxx(7) and Ryy(r), find general conditions on them such that U(t) is also 
a WSS random process. Show that your conditions work, that is, that the resulting 
process U(t) is actually WSS. Some helpful trigonometric identities: 


cos(a + 3) = cosacos 3 = sinasin GB 


sin(a + 8) = sinacos 2 + cosasin f. 


Find the steady-state probabilities of the four-state Markov chain shown in Figure 
P9.61. Express your answers in terms of the exponential rates A; and y;. Note the 
state labels are conveniently given as 1 through 4. 


ogorogo 


Figure P9.61 


Hint: Remember the probability flow concept from Problem 9.51 (b). 
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9.62 Consider the three-state Markov process X(t) with state-transition diagram shown 
in Figure P9.62. Here the state labels are the actual outputs, that is, X(t) = 3 all the 
while the process is in state 3. The state transitions are governed by jointly inde- 
pendent, exponentially distributed interarrival times, with average rates as indicated 
on the branches 


ry 2) 
CLES 
a Me 


Figure P9.62 Three-state Markov process. 


(a) Given that we start at state 2 at time ¢t = 0, what is the probability that we 
leave this state for the first time at time t, for some arbitrary t > 0? 
(b) Find the vector differential equation for the state probability at time t > 0, 


dp 

— = Ap(t 

EAP), 
where p(t) = [p1, p2,p3|", expressing the generator matrix A in terms of the 
vi and Ly. 

(c) Show that the solution for t > 0 can be expressed as 


p(t) = exp(At) p(0), 
where p(0) is the initial probability vector and the matrix exp(At) is defined 
by the infinite series 
1 


exp(At) 21+ At4 (At)? 3 At)” 7 (Ab)! pre 


Do not worry about convergence of this series, but it is known that it abso- 
lutely converges for all finite t. 
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Advanced Topics in 
Random Processes 


In this chapter, we reconsider some of the topics in Chapters 8 and 9 from a more advanced 
or sophisticated viewpoint. In particular we introduce the mean-square derivative and inte- 
gral to provide an extension of the sample function stochastic integral and derivative of 
Chapter 9. This will increase the scope of our linear system analysis of random processes 
to a much broader class, called second-order processes, that is most often encountered in 
more advanced studies, as well as in routine practice. 


10.1 MEAN-SQUARE CALCULUS 


From our work with limits in Chapter 8, we expect that the mean-square derivative and 
integral will be weaker concepts than the sample-function derivative and integral that we 
looked at in Chapter 9. The reason this added abstractness is necessary is that many useful 
random processes do not have sample-function derivatives. Furthermore, this defect cannot 
be determined from just examining the mean and correlation or covariance functions. In the 
first section we begin by looking at the various concepts of continuity for random processes. 


Stochastic Continuity and Derivatives [10-1] 


We will consider random processes that may be real- or complex-valued. The concept of 
continuity for random processes relies on the concept of limit for random processes just the 
same as in the case of ordinary functions. However, in the case of random processes, as for 
random sequences, there are four concepts for limit, which implies that there are four types 
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of stochastic continuity. The strongest continuity would correspond to sure convergence of 
the sample function limits, 


sample- function continuity, lim X(t+e,¢)=X(t,¢) forall CEQ. 
eo 


The next strongest situation would be to disregard those sample functions in a set of proba- 
bility zero that are discontinuous at time t. This would yield continuity almost surely (a.s.), 


a.s. continuity, P [(im X(s)) # x(t)| =0. 


Corresponding to the concept of limit in probability, we could study the concept of continuity 
in probability. 


p continuity, lim P [|X(s) — X(t)| > ¢] =0 for each ¢ > 0. 


The most useful and tractable concept of continuity turns out to be a mean-square-based 
definition. This is the concept that we will use almost exclusively. 


Definition 10.1-1 A random process X(t) is continuous in the mean-square sense at 
the point t if 
as € — 0 we have E[|X(t + €) — X(t)|?] 3 0. 


If the above holds for all t, we say X(t) is mean-square (m.s.) continuous. Tl 


One advantage of this definition is that it is readily expressible in terms of correlation 
functions. By expanding out the expectation of the square of the difference, it is seen that 
we just require a certain continuity in the correlation function. 


Theorem 10.1-1 The random process X(t) is m.s. continuous at t if Rx x(t1, te) is 
continuous at the point t; = to = t. 


Proof Expand the expectation in Definition 10.1-1 to get an expression involving Rx x, 
E||X(t +e) — X()|"] = Rxx(tt+e,t+e)—- Rxx(t,t+e) 
—Rxx(t + €, t) + Rxx (t,t). 


Clearly the right-hand side goes to zero as « — 0 if the two-dimensional function Rxx is 
continuous at tj =t2=t. 


Example 10.1-1 
(standard Wiener process) We investigate the m.s. continuity of the Wiener process of 
Chapter 9. By Equation 9.2-10 we have 


Rxx(t1,t2) = min(t1, ta), ty, tg = 0. 


The problem thus reduces to whether the function min(t,,t2) is continuous at the point 
(t,t). (See Figure 10.1-1.) The value of the function min(t,, tz) at (t,t) is t, so we consider 
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tb 


min(t,, t) = const. 


Figure 10.1-1 Contour plot of min(t1, t2). 


|min(t, t2) — ¢| 
for tj =t+e and tg = t+ €a, 
|min(t + €,,t+ E2) = t|. 


But 
|min(t + €1,t-+ €2) — t| < max (€1, €9), 


so this magnitude can be made arbitrarily small by choice of ¢; > 0 and €2 > 0. Thus the 
Wiener process is m.s. continuous. 


Lest the reader feel overly confident at this point, note that the Poisson counting process 
has the same correlation function when centered at its continuous mean function; thus, 
the Poisson process is also m.s. continuous even though the sample functions of this jump 
process are clearly not continuous! Evidently m.s. continuity does not mean that the sample 
functions are continuous. 

We next look at a special case of Theorem 10.1-1 for WSS random processes. 


Corollary 10.1-1 <A wide-sense stationary random process X(t) is m.s. continuous 
for all t if Rx x(r) is continuous attr =0. I 


Proof By Theorem 10.1-1, we need continuity on t; = tg, but this is the same as 
7 =0. Hence, Rx x(7) must be continuous at r= 0. I 


We note that in the stationary case we get m.s. continuity of the random process for all 
time after verifying only the continuity of a one-dimensional function Rx x (7) at the origin, 
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T = 0. Continuity is a necessary condition for the existence of the derivative of an ordinary 
function. However, considering the case where the difference 


a(t+e)—2(t) = O(Ve), (10.1-1) 


we see that it is not a sufficient condition for a derivative to exist, even in the ordinary 
calculus. Similarly, in the mean-square calculus, we find that m.s. continuity is not a suffi- 
cient condition for the existence of the m.s. derivative, which is defined as follows. 


Definition 10.1-2 The random process X(t) has a mean-square derivative att if the 
mean-square limit of [X(t+¢) — X(t)]/e existsase— 0. I 


If it exists, we denote this m.s. derivative by X’, X“), dX/dt, or X. Generally, we do 
not know X’ when we are trying to determine whether it exists, so we turn to the Cauchy 
convergence criterion (ref. Section 8.7). In this case the test becomes 


B (|[X(t+e1) — X(@)J/er — [X(E + €2) — X(]/e2!”) 
(10.1-2) 
— 0 as €; and €9 > 0. 


As was the case for continuity, we can express this condition in terms of the correlation 
function, making it easier to apply. This generally useful condition is stated in the following 
theorem. 


Theorem 10.1-2 A random process X(t) with autocorrelation function Rx x (t1, te) 
has an m.s. derivative at time t if 0? Ry x(t, t2)/Ot te exists at t1 = te =t. 


Proof Expand the square inside the expectation in Equation 10.1-2 to get three terms, 
the first and last of which look like 


B||(X(t+e) — X(®) /e? 
= [Rxx(t+e,t+e) — Rxx(t,t+e) — Rxx(t+e,t) + Rxx(t,t)\/e” 


which converges to 
O° Rx x (t1, t2)/Ot1 Oto, 


if the second mixed partial derivative exists at the point (t1,t2) = (t,t). The middle or 
cross-term is 


—2E([X(t + €1) — X()J/e1 - [X(t + €2) — X(t)]*/e2) 

= —2(Rxx(t+e1,t +62) — Rxx(t,t +22) — Rxx(t+e1,t) + Rxx(t,t)) /ere 

= —2([Rxx(t+ e1,t + €2) — Rxx(t+€1,0)|/eo — [Rxx(t,t + €2) — Rxx(t,t)]/e2) /er 
= 0 (ee) 


Oty Ot2 


(ta ,tz2)=(t,t) 


= —20*Rxx (ti, t)/Ot Ota| a. oy.) 
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if this second mixed partial derivative exists at the point (t,,t2) = (t,t). Combining all 
three of these terms, we get convergence to 


20° Rx x /Ot,Otz — 20°Rxx/Ot,0t2 =0. 


In the preceding theorem, the reader should note that we are talking about a two- 
dimensional function Rx x(t1,t2) and its second mixed partial derivative evaluated on the 
diagonal points (t1,t2) = (t,t). This is clearly not the same as the second derivative of the 
one-dimensional function Rx x (t,t), which is the restriction of Rx x(ti,t2) to the diagonal 
line t; = tg. In some cases the derivative of Rx x(t,t) will exist while the partial derivative 
of Rxx(h, t2) will not. 


Example 10.1-2 
(derivative of Wiener process) Let W(t) be a Wiener process with correlation function 
Rww(t, tz) = 0? min(t1, tz). We enquire about the behavior of E[|(W(t+¢) — W(t)) /e|?] 
when ¢ is near zero. Assuming that ¢ is positive, we have by calculation that 
E|\(W(t+e) —W(t))|?] = 07 c, which shows that the Wiener process is mean-square contin- 
uous, as we already found in Example 10.1-1. But now when we divide by ¢€? inside the 
expectation, as required by E[|(W(t +e) — W(t))/e|?], we end up with o?/e which goes to 
infinity as ¢ approaches zero. So the mean-square derivative of the Wiener process does not 
exist, at least in an ordinary sense. Looking at the above equation, we see the problem is 
that in some sense at least, the sample functions w(t) of the Wiener process have increments 
typically on the order of w(t + ¢) — w(t) = ove. 


Example 10.1-3 
(exponential correlation function) Let X(t) be a random process with correlation function 
Rxx(ti,t2) = 0? exp(—alt; — t2|). To test for the existence of an m.s. derivative X’, we 
attempt to compute the second mixed partial derivative of Ry x. We first compute 


O 
Ob [o? exp(—a(tg _ t1)] ‘ ty < to 
2 


ORxx/dt, = 4 (10.1-3) 
—<_ [o? exp(—a(ty — t2)] , bt, > te, 
Oto 
_ f -ao?exp(—a(tz—-t1)), ti < te 
- ee o” exp(—a(t; —t2)), t > te. aes) 


Graphing the function Rx x(t1, tz) as shown in Figure 10.1-2, we see that there is a cusp on 
the diagonal line t; = tg. Thus, there is no partial derivative there for any ¢. So the second 
mixed partial cannot exist there either, and we conclude that no m.s. derivative exists for 
an X(t) with this correlation function. Evidently, such random processes are not smooth 
enough. 


+While it is conventional to use o? as the parameter of the Wiener process, please note that o? is not 
the variance! Earlier we used a for this parameter. 
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Figure 10.1-2 Contour plot of Rxx(ti, t2) of Example 10.1-3. 


Example 10.1-4 
(Gaussian shaped correlation) We look at another random process X(t) with mean function 
[tx = 5 and correlation function, 


Rxx (ti, te) = 07 exp(—a(t; — t2)”) + 25, 


which is smooth on the diagonal line. The first partial with respect to te is 


OR 
2% = 9a(t; — ty)o” exp(—a(t — t2)’). 
Ot 
Then, the second mixed partial becomes 
O?R 
bi, = 2ao” [1 — 2a(t; — t2)*] exp(—a(ti — t2)”), 
which evaluated at t; = tg = t becomes 
O*7Rxx 
ia ks = 2a07 
Ot Oty ass 


(t1,t2)=(t,t) 


so that in this case the m.s. derivative X’(t) exists for all t. 


Given the existence of the m.s. derivative, the next question we might be interested 
in is: What is its probability law? Or more simply: (a) What are its mean and correlation 
function?; and (b) How is X’ correlated with X? To answer (a) and (b), we start by 
considering the expectation 

E[X'(t)| = E [dX (t) /dt]. 
Now assuming that the m.s. derivative is a linear operator, we would expect to be able to 
interchange the derivative and the expectation under certain conditions to obtain 


E[X"(t)] = dB[X(t)|/dt = du (t)/dt. (10.1-5) 
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We can show that this equation is indeed true for the m.s. derivative by making use of the 
inequality 

|E[Z]|? < E[\Z|?], 
for a complex random variable Z, a consequence of the nonnegativity of Var[Z]. First we 
set A, 2 n[X (t+ 1/n) — X(#)] and note that the right-hand side of Equation 10.1-5 is just 
limn—+oo E[An]. Thus, Equation 10.1-5 will be true if lim, E[X'(t) — An] = 0. Then, 
making use of the above inequality with Z 4x ‘(t) — An, we get 


|E[X'(t) — Anj|? < E[|X"(t) — Anl?), 


where the right-hand side goes to zero by the definition of m.s. derivative. Thus, the left- 
hand side must also go to zero, and hence we are free to interchange the order of m.s. 
differentiation and mathematical expectation, that is, Equation 10.1-5 has been shown to 
be correct. 
To calculate the correlation function of the mean-square derivative process, we first 
define 
Rxx/(t1, te) = E[X"(t1)X" (t2)], (10.1-6) 


and formally compute, interchanging the order of expectation and differentiation, 


Rye xi(taste) = lim c (x @ + *) 7 x(t)) -m (x (1 + =) - x(t)) | 


ORxx (ti +4,t2) ORxx(t, te) 
Ot Ot2 


lim n 
n—co 


= Rxx (ti, te) /Ot Ate. 


This is really just the first step in the proof of Theorem 10.1-2 generalized to allow t, £ to. 
To justify this interchange of the m.s. derivative and expectation operators, we make use of 
the Schwarz inequality (Section 4.4 in Chapter 4), which has been derived for real-valued 
random variables in Chapter 4. (It also holds for the complex case as will be shown in the 
next section.) 
We first define 
A * 
Rxx' (ti, t2) = E[X (1) X” (t2)). (10.1-7) 
We use the Schwarz inequality |E[AB*]| < /E||A|?]E||B,|?] with A a X(t,) and 
Br & X"(t) — [X (to + €n) — X (te) /En 
to obtain 
Rxxi(ti, ta) = Jim E [X(t1) (X (te + En) - X(t2))* /én] 
= Jim [Rx x(t, te + €n) — Rx x(t, t2)]/en 


_ ORx x(t1, ta) 
Ote 
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Then E[X’(t,)X’*(t2)] is obtained similarly as 


lim E (2 + €n) = =O x(t) = lim [Rxx(t1 + Ens te) — Rxx*(t1, t2)]/en 


noo En 
= ORx x'(ti, t2) /Ot1. 
Thus, we have finally obtained the following theorem. 


Theorem 10.1-3 Ifarandom process X(t) with mean function jx (¢) and correlation 
function Rx x(ti,t2) has an m.s. derivative X’(t), then the mean and correlation functions 
of X’(t) are given by 

bxr(t) = dux(t)/dt 
and 
Rxrx+(t1, t2) = 0? Rx x(t, te) /Ot1Ote. 


Example 10.1-5 
(m.s. derivative process) We now continue to study the m.s. derivative of the process X (t) 
of Example 10.1-4 by calculating its mean function and its correlation function. We obtain 


bixr(t) = dux(t)/dt = 0 
and 
Rxix1(t1, t2) = 0? Rx x(t, te) /Ot Oty 
= 2a07/1 — 2a(t; — t2)?] exp(—a(t: — te)?). 


We note that in the course of calculating Rx x-, we are effectively verifying existence of 
the m.s. derivative by noting whether this deterministic partial derivative exists at t1 = te. 


Generalized m.s. derivatives. In Example 10.1-3 we found that the correlation func- 
tion had a cusp at t; = tg which precluded the existence of an m.s. derivative. However, 
this second mixed partial derivative does exist in the sense of singularity functions. In 
earlier experience with linear systems, we have worked with singularity functions and have 
found them operationally very elegant and simple to use. They are properly treated math- 
ematically through the rather abstract theory of generalized functions. If we proceed to 
take the partial derivative, in this generalized sense, with respect to t; of ORx x /Ot2 in 
Equation 10.1-4, the step discontinuity on the diagonal gives rise to an impulse in t;. We 
obtain 


0? Rxx (tt, te) _ {Sen exp(—a(tg = t1)), ty < to 


2 — 
Ot Ot2 —a’o” exp(—a(ti — t2)), ti > 7 + 2a0°6(ty — te) 


(10.1-8) 


= 2a076(t, — tz) — ao? exp(—alt, — ta). 


We can call the random process with this autocorrelation function a generalized random 
process and say it is the generalized m.s. derivative of the conventional process X(t). When 
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we say this, we mean that the defining mean-square limit (Equation 10.1-2) is zero in the 
sense of generalized functions. A more detailed justification of generalized random processes 
may be found in [10-2]. In this text we will be content to use it formally as a notation for 
a limiting behavior of conventional random processes. 

The surprising thing about the autocorrelation function in Equation 10.1-8 is the term 
2a076(t; — tz). In fact, if we single out this term and consider the autocorrelation function 


Rx x(t, t2) = 076(t1 — te), (10.1-9) 


we can show that it corresponds to the generalized m.s. derivative of the Wiener process 
defined in Chapter 9. By definition, w(t) = 0 and Rxx(ti,t2) = amin(t1,t2) for the 
Wiener process. We proceed by calculating, 


ORx x /Ot2 = | 


ata, tgSt Ja, teh 
Ot 


ati, t2>t, 0, ty > te. 


Then 


2 
PEE 2 ‘e fa Sth _ © u(t — t2)] = a8(ts — ty), 


Ot Oty ~ Oty 0, ty > ty = Oty 


which is the same as Equation 10.1-9 if we set 0? = a.! 

The generalized m.s. derivative of the Wiener process is called white Gaussian noise. It 
is not a random process in the conventional sense since, for example, its mean-square value 
Rx-x:(t,t) at time t is infinite. Nevertheless, it is the formal limit of approximating conven- 
tional processes whose correlation function is a narrow pulse of area a. These approximating 
processes will often yield almost the same system outputs (up to mean-square equivalence); 
thus, the white noise can be used to simplify the analysis in these cases in essentially the 
same way that impulse functions are used in deterministic system analysis. In terms of 
power spectral density (psd), the white noise is the idealization of having a flat psd over all 
frequencies, Sx/x/(w) = a, —co <w < +00. 

Many random processes are stationary or approximately so. In the stationary case, 
we have seen that we can write the correlation function as a 1-D function Rxx(7). Then 
we can express the conditions and results of Theorems 10.1-2 and 10.1-3 in terms of this 
function. Unfortunately, the resulting formulas are not as intuitive and tend to be somewhat 
confusing. For the special case of a stationary or a wide-sense stationary (WSS) random 
process, we get the following: 


Theorem 10.1-4 The ms. derivative of a WSS random process X(t) exists at time 
t if the autocorrelation Rx x (7) has derivatives up to order two at 7 = 0. 


Proof By the previous result we need 0?7Rx x (t1, t2)/Ot1 Ite] 4, atey Now 


Rxx(rT) = Rxx(t4+7,t), functionally independent of t, 


+Please note again here that o? is merely a positive parameter here. It is not the variance of the white 
noise. We use this notation only because it has become standard to do so. 
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me) 
ORx x (ti, ta) 
Oty 


since tp = t is held constant, and 


ORx x (ti, ta) 
Ot 


= dRxx(r)/dr 
(t1,t2)=(t+7,t) 


_ ORxx(t,t — T)/O(—T) 


(t1,t2)=(t,t—T) 


= —dRxx(r)/dr, 


since t; = t is held constant here; thus 


6? Rx x (ti, te) 


— ge 2 
At, Oty =-—d Rxx(t)/dr . [24] 


(t1,t2)=(t+7,t) 


Calculating the second-order properties of X’, we have 


E[X'(t)] = wxr(t) = 0, 


E[X'"(t+7)X*(t)] = Rx’x(7T) = +dRxx(r)/dr, (10.1-10a) 
E[X(t + 7)X"*(t)] = Rxx:(7) = —dRxx(r)/dr, (10.1-10b) 

and 
E[X'(t¢+7)X"(#)] = Rx-x:(r) = —@Rxx(r)/dr?, (10,.1-11) 


which follow from the formulas used in the above proof. 
One can also derive these equations directly, for example, 


E[X(t +7) X*(t)] = lim — +7r)X*(t+e)] — E[X(t+ 2x01) 


e—0 E 
ij (Ant oae Axi) 
= lim 

e0 & 


_ —dRx x(r)/dr _ Rxx(T) 
and similarly for Rx:x(T). 


Example 10.1-6 
(m.s. derivative of WSS process) Let X(t) have zero-mean and correlation function 


Rxx(T) = a exp(—a?r”). 


Here the m.s. derivative exists because R(T) is infinitely differentiable at r = 0. Computing 
the first and second derivatives, we get 


dRxx /drt = —20°70? exp(—a?r’). 


Then 
Rxix:(T) = —@’Rxx /dr? = 20707(1 — 2a77”) exp(—a?r’). 
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Further Results on m.s. Convergence [10-1] 


We now consolidate and present further results on mean-square convergence that are helpful 
in this chapter. We will have use for two inequalities for the moments of complex random 
variables: the Schwarz inequality and another inequality called the triangle inequality. 


Complex Schwarz inequality. Let X and Y be second-order complex random variables, 
that is, 
E||X?]<oo and E||Y|?] <0; 


then we have the inequality 


E[XY"}| < VEXPIETY|?]. 


Proof Consider W 2 aX +Y. Then minimize E||W|?] as a function of a, where a isa 
complex variable. Then, since the minimum must be nonnegative, the preceding inequality 
is obtained as follows. First we write the function, 


f(a) = El|W/? 
= E[(aX + Y)(a*X* +¥"*)]. 


Now, we want to minimize this function with respect to (wrt) the complex variable a. If a 
were a real variable, we could just take the derivative and set it equal to zero and then solve 
for the minimizing a. Here, things are not so simple. The most straightforward approach 
would be to express a = a, + ja; and then consider partial derivatives wrt the real part a, 
and imaginary part a; separately. A more elegant approach is to take partials wrt a and a’, 
as though they were independent variables, which they are not! Using this latter method 
we get 


0= oh a) = E[X(a"X* +Y")| 
= oNa) = E[(aX + Y)X", 


either of which yields the desired result that the minimizing value of a is 


E[X*Y] 


Qmin = : 
E||X]?] 
Finally, evaluating f(amin), and noting that f(a) is always nonnegative, we get 


* 2 
_IELCYIP 


a te ge 


which is equivalent to the above stated Cauchy—Schwarz inequality. [yj 
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An interesting aside here is that we can regard —dyinX as a best linear estimate! of Y, 
which we can denote as Y. We have the result 
= E[X*Y] 
Y = —_~ X. 
E||X]?] 
Another inequality that will be useful is the triangle inequality, which shows that 
the root mean-square (rms) value ,/£[|X|?] can be considered as a “length” for random 
variables! 


Triangle inequality. Let X and Y be second-order complex random variables; then, 


VEX +Y?] < VEIX?] + VEY? 


Proof 


[ 
= El|X|?] + ELXY*] + ELX*Y] + Ely? 
< E||X|?] + 2|E[XY*]| + EY |] 
< El|X|?] + 2/E(|XPIELY 7] + ELLY] 


(by the Schwarz inequality) 


=(/EIXF+ VIVA). 9 


Note that the quantity ./£[|X|?] thus obeys the equation for a distance or norm, 
I+ YI] < XI + IYI (10.1-12) 


and hence the name “triangle inequality,” as pictured in Figure 10.1-3. We see that the 
“length” of the vector sum X + Y can never exceed the sum of the “length” of X plus the 
“length” of Y. 

The norm ||X]| also satisfies the equation 


||aX| = Jal - [|| 


for any complex scalar a. In fact, one can then define the linear space of second-order 
complex random variables with norm 


A 
|X|] = VEX]. 
This space is then a Hilbert Space with inner product 
(X,Y) 2 ELxy*). 


+Warning: A better “linear” estimate would be aX +b, and this would give a lower error when the mean 
of X or Y is not zero. 
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X+Y 


Figure 10.1-3 Conceptualization of triangle inequality in linear space of random variables. 


Writing the Cauchy—Schwarz inequality in this geometric notation, we have 
(X,Y) | < [XII 


and the best linear estimate becomes 


<> (Y, X) 

— Xx, 
TXT 

ese eee | 
xed xn 


three terms with the following interpretations: the first term is the orthogonal projection 
onto Y; the second term is the normalized vector X; and the third term is a scaling up to 
the norm of Y. This orthogonal projection of Y onto X is illustrated in Figure 10.1-4. 

If (Y,X) = 0, then we write X | Y, and say “X and Y are orthogonal random 
variables.” Note that in this case Y would be zero. If (Y,X) would take its largest 
value ||X||||Y||, by the Cauchy—Schwarz inequality, then we would have the estimate 
Y= TT \|Y||, which is just a scaling of the random variable X, by the ratio of the norms 
(standard deviations). Continuing this geometric viewpoint, we can get the Pythagorean 
theorem if X | Y, because then by the above proof, the triangle inequality Equation 10.1-12 
takes on its maximum value and ||X + Y||? = ||X||? + ||Y]|?. 

Later in this chapter we will be working with m.s. derivatives and integrals and also m.s. 
differential equations. As such, we will have need of the following general results concerning 
the moments of mean-square limits. 
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Figure 10.1-4 Illustration of orthogonal projection of random variable Y onto random variable X. 


Theorem 10.1-5 Let X[n] — X and Y[n] — Y in the ms. sense with E[|X|?] < co 
and E||Y|?] < oo. Then we have the following five properties 
(a) lim E[X{n]] = FLX], 


n—Co 


(b) lim E[X[n]¥*) = ELXY*], 


n—-OCo 


(c) lim E [|X[n]|?] = EX], 


n—oo 


(d) lim £[X[n]¥*[n]] = ELXY*I, 


n—Co 


(e) if X = X; (ms.), then PLX 4 X,] = 0, that is, X = Xy (a.s.). 
Proof 


(a) Let A, 2 X —X[n]; then we need limn—..o E[An] = 0. But |E[An]|? < ||Anl|? 


where ||Ap|| E||A,|?] and ||An|| — 0 by the definition of m.s. convergence; 
thus lim;—oo E[A,] = 0 
(b) We note XY* — X[n]Y* = A,Y*. Then by the Schwarz inequality, 


JE[AnY“]?? < ||Anl|? IVI? +0 as noo, 


since Y is second order, that is, ||Y||? is finite. 
(c) This is the same as ||X[n]|| — ||X|| in terms of the norm notation. Now by the 
triangle inequality Equation 10.1-12, 


|| X[n]|] = ||X[n] — X + XI] < |X [nr] — X]] + [1X1]. 
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This tells us that lim, —... ||X[n]|| < ||X]]. But also ||X|] = ||X — X[n] + X[n]]| < 
\|X — X[n]|| + ||X[n]||, which implies lim,... ||X[n]|| > ||X||. Therefore, 
limn—oo ||X[n]|| = ||X]l- 

(d) Consider XY* — X[n]¥*[n], and add and subtract X[n]Y* to obtain 


(X — X[n]) Y* + X[n] (Y — Y[n])*. 
Next use the triangle inequality followed by the Schwarz inequality to obtain, 
\E[XY* — X{r¥*[rl]] < X — XPM - (IVI + XII IY — Yel 
—0_ by definition of m.s. convergence and by property (3). 
(e) Let ¢ > 0; then by the Chebyshev inequality we have, 
P||X — X4| >] < E[|X — X,|?]/e? =0. 
Since ¢ is arbitrary, letting the event B,, S {¢| |X — Xy| > 1/n}, we have B, 7 


‘Be = {X # X}, and so 
PIX 4X] =0, 


by the continuity of P-measure (ref. Theorem 6.1-1). I 


Another useful property for our work later in this chapter will be the fact that the m-s. 
limit is a linear operator. 


Theorem 10.1-6 Let X[n] — X (ms.) and Y[n] — Y (ms.) with both X and Y 
second-order random variables, that is, their second moments are finite. Then 


Jim {aX [n] + bY [n]} = aX + bY. (m.s.) 


Proof We have to show that 


\|aX[n] + bY [n] — (aX + bY)|| > 0. 
By the triangle inequality Equation 10.1-12, 
lla (X[n] — X) + 6(¥[n] — Y) |] < la (X[n] — X) |] + [10 (VTn] — YI 
la] - || X[n] — X]] + [2] - []¥ Tr] — ¥1| 


-Oasn—- oo, 
by assumed m.s. convergence of X[n] and Y[n]. 


Corollary 10.1-2 The ms. derivative is a linear operator, that is, 


“ laX(t) + OY (f)] = aX"(t) + 6Y"(0). (ms.) 


Proof We leave this as an exercise to the reader. 


650 Chapter 10 Advanced Topics in Random Processes 


Example 10.1-7 
(m.s. convergence) Let the random sequence X[n] be given by the second-order random 


variables X; and X2 as X[n] £ (1 — 1/n)X1 + (1/n)X2. By m.s. convergence theory we 
know that the m.s. limit of X[n] as n — oo is X = X,. Then by Theorem 10.1-5 property 
(a) we have that the mean value of X is E[X] = lim E[X[n]] = E[X], and by property (c) 
that the average power of X is E[|X|?] = lim E[|X[n]|?] = E[|X.|?]- 


10.2 MEAN-SQUARE STOCHASTIC INTEGRALS 


In this section we continue our study of the calculus of random processes by considering 
the stochastic integral. Stochastic integrals are important in applications for representing 
linear operators such as convolution, which arise when random processes are passed through 
linear systems. Our discussion will be followed in the next section by a look at stochastic 
differential equations, which of course will involve the stochastic integral in their solution. 
Another application of the stochastic integral is in forming averages for estimating the 
moments and probability functions of stationary processes. This topic is related to ergodic 
theory and will be introduced in Section 10.4. 

We will be interested in the mean-square stochastic integral. It is defined as the mean- 
square limit of its defining sum as the partition of the integration interval gets finer and 
finer. We first look at the integration of a random process X(t) over a finite interval (T), 72). 
The operation of the integral is then just a simple averager. First we create a partition of 
(T,, Tz) consisting of n points using (t1, t2,..., tn). Then the approximate integral is the sum 


i= > XG )At. (10.2-1) 
i=1 


On defining the m.s. limit random variable as J, we have the following definition of the 
mean-square integral. 


Definition 10.2-1 The mean-square integral of X(t) over the interval (71,72) is 
denoted J. It exists when 


n 2 
lim E ||I—S~ X(t:)Ati| | =0. (10.2-2) 
i=l 
We give I the following symbol: 
T2 
I=/ x(jdtl limit, (ms) @ 
TT n— co 


Because the m.s. limit is a linear operator by Theorem 10.1-6, it follows that the integral 
just defined is linear, that is, it obeys 


T2 T2 T2 
| [aX(t)+bXo(t)]) dt=af Xi(t)dt+o | Xa(t)at, 
Ti Ti Ti 


whenever the integrals on the right exist. 
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To study the existence of the mean-square integral, as before we make use of the Cauchy 
criterion for the existence of a limit. Thus, the integral I exists iff lim noo E[|In —Im|?] = 
0. If we expand this expression into three terms, we get E[|I,|7] -2Re{E[I,I*,]}+ E[|Im|"]- 
Without loss of generality, we concentrate on the cross-term and evaluate 


FULTS] = >" Rxx (tit )AtiAty, 
ind 


where the sums range over 1 <i << nand1<j<m.Asm,n— +o, this converges to the 
deterministic integral 


[ 7 Rxx( (t1, t2) dt, dt, (10.2-3) 


if this deterministic integral exists. Clearly, the other two terms E||J,|?] and E[|Im|?] 
converge to the same double integral. Thus, we see that the m.s. integral exists whenever 
the double integral of Equation 10.2-3 exists in the sense of the ordinary calculus. 

The mean and mean-square (power) of J are directly computed (via Theorem 10.1-5) as 


T2 T2 
EU|=E x(a =| E[x(@lat, 
Ti Ty 
and 
BUI?) -2/ ff * X(t) did 


(10.2-4) 
= ee “Bete a eidle, 
Ty Ty 


The variance of J is given as 


T2 T2 
o2 = | Keel: edd. 
Ti Ty 


Example 10.2-1 
(integral of white noise over (0,t]) Let the random process X(t) be zero mean with covari- 
ance function Kx x(T) = 076(r) and define the running m.s. integral as 


t 
2} X(s)ds, t>0. 
0 
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For fixed t, Y(t) is a random eek nie when indexed by t, Y(t) is a stochastic 
process. Its mean is given as E[Y(t) =e [tx (s)ds and equals 0 since ty = 0. The covari- 
ance is calculated for t, > 0, tg > 7 as 


Kyy (ti, te) = [ ie Kx x (81, 82)dsidso, 


=o [ [ 3( (s1 — 82 )ds, dso, 
=o ai u(ty = 82)dsa, 
0 


min(t1,t2) 
=> o | ds, 
0 


= 0? min(ty, t2), 


which we recognize as the covariance of the Wiener process (Section 9.2). Thus, the m.s. 
integral of white Gaussian noise is the Wiener process. Note that Y(t) must be Gaussian 
if X(t) is Gaussian, since Y(t) is the m.s. limit of a weighted sum of samples of X(t) (see 
Problem 10.13). 


We can generalize this integral by including a weighting function h(t) multiplying the 
random process X(t). It would thus be the mean-square limit of the following generalization 


of Equation 10.2-1: 
In = S~ (ti) X(ti)At 
i=1 


where the n points ty < tg <... < t, are a partition of the interval (7), 7»). 
This amounts to the following definition for the weighted integral. 


Definition 10.2-2 The weighted mean-square integral of X(t) over the interval 
(T1,T2) is defined by 
2 


lim £ 


n—Co 


=0 


— S h(ti)X (ti) Ati 
i=1 


when the limit exists. We give it the following symbol: 


ra f° A(t)X(t)dt. © 


Ty 
Example 10.2-2 
(application to linear systems) A linear system L has response h(t, s) at time ¢t to an impulse 


applied at time s. Then for a deterministic function x(t) as input, we have the output y(t) 
given as 
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+00 
y(t) = L{a(t)} = - A(t, s)a(s)ds, (10.2-5) 


whenever the foregoing integral exists. If #(¢) is bounded and integrable, one condition for 
the existence of Equation 10.2-5 would be 


+oo 
/ |A(t, s)|ds < +00 for all —co <t < +00. (10.2-6) 


—co 
We can generalize this integral to an m.s. stochastic integral if the following double integral 
exists: 


+oo +oo 
/ / A(t1, $1)h* (te, $2)Rxx(s1, $2)dsidso, 


in the ordinary calculus. A condition for this, in the case where Rx x is bounded and 
integrable, is Equation 10.2-6. Given the existence of such a condition, Y(t) exists as an 
m.s. limit and defines a random process, 

+00 


Y(t) = / h(t, s)X(s)ds, 


—Co 


whose mean and covariance functions are 


+00 
py (t) = i A(t, s)ux(s)ds, 


—co 


+o0o +00 
Ryy (ti, te) = / / A(ty, $1) h* (te, $2) Rx x (s1, 82)ds1 ds. 


10.3 MEAN-SQUARE STOCHASTIC DIFFERENTIAL EQUATIONS 
Having introduced stochastic derivatives and integrals, we now turn to the subject of 
stochastic differential equations. The simplest stochastic differential equation is 

dy (t)/dt = X(t), 


where the derivative on the left is an m.s. derivative. The solution turns out to be 
t 
Y(t) = / X(s)ds + Y(to), t = to, 
to 


where the integral is an m.s. integral. 
Using the general linear, constant-coefficient differential equation (LCCDE) as a model, 
we form the general stochastic LCCDE, 


an¥ ™(£) tani YY (6) +... tan ¥@ = XM), (10.3-1) 
for t > 0 with prescribed initial conditions, 
VO).YOO), an? O), 


where the equality is in the m.s. sense. 
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To appreciate the meaning of Equation 10.3-1 more fully, we point out that it is a mean- 
square equality at each t separately. Thus, Equation 10.3-1 is an equality at each t with 
probability-1 and hence at any countable collection of t’s by property (e) of Theorem 10.1-5. 
However, this does not say that the sample functions of Y(t), which in fact may not even 
be differentiable, satisfy the differential equation driven by the sample functions of X(t). 
The sample function interpretation of Equation 10.3-1 would require that it hold with 
probability-1 for all t > 0, which is an uncountable collection of times. 

We can only think of the m.s. differential equation as an idealization approached in the 
limit, much like the impulse and the ideal lowpass filter. Also, the m.s. differential equation 
must be treated with care because it is not quite what it seems. Just as we would not put an 
impulse into a linear circuit in reality, so we would not use a random process lacking sample 
function derivatives as the input when we simulate a stochastic differential equation. Instead, 
we would use a smoother approximation to the process and if this smoothing is slight, we 
would expect that the idealized solution obtained from our m.s. differential equation would 
have similar properties. This, of course, needs further justification, but for the present we 
will assume that it can be made precise. 

One may raise the question: Why work with such extreme processes, that is, processes 
without sample-function derivatives? The answer is that the analysis of these m.s. differen- 
tial equations can proceed very well using the basic methods of linear system analysis. If 
we had instead included the extra “smoothing” to guarantee sample-function derivatives, 
then the analysis would be more complicated. For comparison, imagine trying to find the 
exact response of a linear system to a very narrow pulse of area one versus finding the ideal 
impulse response. 

We proceed by finding the mean function jy-(t) and correlation function Ryy (ty, t2) 
of the m.s. Equation 10.3-1. Since we have equality with probability-1 for each t, we can 
compute the expectations of both sides of Equation 10.3-1 and use 


d' E[Y (t)|/dt? = BY (t)], 


to obtain ; 
dnpy (t) + dng )(t) +... + aopty (t) = x(t) (10.3-2) 


with prescribed initial conditions at t = 0, 
w2(0) = E[Y(0)] fori =0,1,...,n—1. 


Thus, the mean function of Y(t) is the solution to this linear differential equation, whose 
input is the mean function of X(t). So knowledge of x(t) is sufficient to determine ju-(t), 
that is, we do not have to know any of the higher moment functions of X(t). Note that 
this would not be true if we were considering nonlinear differential equations. However, 
essentially no change would be necessary to accommodate time-varying coefficients, but of 
course the resulting equations would be much harder to solve. Thus, we will stay with the 
constant-coefficient case. Parenthetically, we note that if w(t) = 0 for all t > 0 and the 
initial conditions are zero, then clearly py-(t) = 0 for all t > 0. 

Next we determine the cross-correlation function Rxy(t1,t2), which is the correlation 
between the input to Equation 10.3-1 at time t; and the output at time tg. This quantity can 
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be useful in system identification studies. We will assume for simplicity that the coefficients 
a; are real. Then we substitute tz for t in Equation 10.3-1, conjugate, and multiply both 
sides by X(t1) to obtain 


X (t1) ys ano) = X(t) X*(tz2), t12>0, t2>0, 
1=0 


which holds with probability-1 by Theorem 10.1-5, property (e). Taking expectations we 
obtain 


So a E[X (ty) ¥ O*(tg)] = E[X (1) X* (te)], 
i=0 


or 


n 
s a,Ryy w(t, te) = Rxx(th, te), 
i=0 


which, using Ryy = 0 Rxy /Oti, is the same as 


n 


S © a0 Rxy (ti, t2)/Oth = Rxx(ti,te), te > 0, (10.3-3) 
1=0 


for each t; > 0, subject to the initial conditions 
dO Rxy(t1,0)/dt, for i=0,1,...,n—1. 


To obtain a differential equation for Ryy(ti,t2), we multiply both sides of Equation 10.3-1 
by Y*(t2) and similarly obtain, for each tz > 0, 


Sa, Ryy (tr, t2)/Ot, = Rxy(ti,te) for t, > 0, (10.3-4) 
1=0 


with initial conditions 0 Ryy (0, tz) /dti, for i =0,1,...,n—1. 

One can obtain equations identical to Equations 10.3-3 and 10.3-4 for the covari- 
ance functions Kxy and Kyy by noting that Equation 10.3-2 can be used to center 
Equation 10.3-1 at the means of X and Y. This follows from the linearity of Equation 10.3-1 
and converts it to an m.s. differential equation in Y, and X,. This then yields Equations 
10.3-3 and 10.3-4 for the covariance functions. 

We now turn to the solution of these partial differential equations. We will solve 
Equation 10.3-3 first, followed by Equation 10.3-4. We also note that Equation 10.3-3 is 
not really a partial differential equation since t; just plays the role of a constant param- 
eter. Thus, we must first solve Equation 10.3-3, an LCCDE in time variable tz, for each 
t,, thereby obtaining the cross-correlation Ryy. Then we use this function as input to 
Equation 10.3-4, which in turn is solved in time variable t,, for each value of the parameter 
tg. What remains is the problem of obtaining the appropriate initial conditions for these 
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two deterministic LCCDEs from the given stochastic LCCDE Equation 10.3-1. This is illus- 
trated by the following example, which also shows the formal advantages of working with 
the idealization called white noise. 


Example 10.3-1 
(first-order m.s. differential equation) Let X(t) be a stationary random process with mean 
tx and covariance function Kx x(t) = 076(r).1 Let Y(t) be the solution to the differential 
equation 


dY(t)/dt+aY(t)=X(t),  t>0, (10.3-5) 


subject to the initial condition Y(0) = 0. We assume a > 0 for stability. Then the mean ju, 
is the solution to the first-order differential equation 


py (t) + apy (t) = px, t= 0, 


subject to fy(0) = 0. This initial condition comes from the fact that the initial random 
variable Y(0) equals the constant 0 and therefore E[Y (0)] = 0. The solution is then easily 
obtained as 

y(t) = (ux/a)(1—exp(—at)) for t> 0. 
Next we use the covariance version of Equation 10.3-3 specialized to the first-order differ- 
ential Equation 10.3-5 to obtain the cross-covariance differential equation 


OK xy (t1, te) /Ote2 + ak xy (ty, tz) => 076 (ty = ta), 


to be solved for tz > 0 subject to the initial condition, Kyy(t1,0) = 0, which follows from 
Y(0) = 0. For 0 < tg < ty, the solution is just 0 since the input is zero for tg < t;. For the 
interval tz > t1, we get the delayed impulse response 0? exp(—a(t2 — t1)) since the input is 
a delayed impulse occurring at tg = t;. Thus, the overall solution is 


0, 0< te < th, 


Kxy (ty ta) = e exp(—a(t2—t1)), te >t. 


This cross-covariance function is plotted in Figure 10.3-1 for a = 0.7 and ao? = 3. 
Next we obtain the differential equation for the output covariance Kyy by specializing 
the covariance version of Equation 10.3-4 to the first-order m.s. differential Equation 10.3-5: 
OKyy (ti, ta) 
Oty 


subject to the initial condition at t; = 0 (for each t2): 


+akyy (ti, te) = Kxy (ts, ta), 


Kyy(0,t2) = 0. 
For the interval 0 < t, < to, 


Kxy (ti, te) = o exp(—a(t2 = t1)), 


+Note that the parameter o? used here is not the variance of this white noise process. In fact. Kxx(0) = 
co for white noise. 
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Kxy(t,, t)) plotted versus t, 


Cross-covariance Ky y(t,,ts) 


Output time t 


Figure 10.3-1 Cross-covariance plotted versus t2 for the fixed value t) = 2. 


so 
OK yy (ty, tz) /Oty + akyy (t1, te) = o exp(—a(tz = t1)), 


which has solution 
Kyy (ti, tg) = (a? /2a)e—%? (ge es), 


For t; > te, Kxy(ti,t2) = 0, we then have to solve 
OKyy (t1, t2)/Oti + aKyy (ti, tz) = 90, 


subject to Kyy (ta, t2) = (07/2a)[1 — exp(—2at2)]. We obtain, 


2 
a 
Kyy (ty, tz) = 5a = exp(—2at2)) exp(—a(ty = ta)), ty > to. 
The overall function is plotted versus t, in Figure 10.3-2 for the same a and o? as in 
Figure 10.3-1. 
We note that as t; and tg — +00, the variance of Y(t) tends to a constant, that is, 
Kyy (t,t) 0? /2a. In fact with t) =t+7 and tg = t, the covariance of Y(t) becomes 


2 

sc (1 —exp(—2at))exp(—aTr), 7>0, 
a 

5 

50 (oxP ar —e **exp(—ar)), 7 <0. 
a 


Kyy(t+7,t) = 


Now if we let t — +00 for any fixed value of 7, we obtain 


Kyy(r) = (07/2) exp(—a|r]). 
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Kyy(t;, to) plotted versus ft, 


Auotcovariance Kyy(t;, tp) 


0 1 2 3 4 5 6 7 8 9 10 
Output time t, 


Figure 10.3-2 Plot of output covariance versus t; with fixed to = 4. 


This is an example of what is called asymptotic wide-sense stationarity. It happens here 
because the input random process is WSS and the LCCDE is stable. In fact the only thing 
creating the nonstationarity is the zero initial condition, the effect of which decays away 
with time due to the stability assumption. 

The reader may wonder at this point whether the random process Y (¢) with correlation 
Ryy and cross-correlation Rxy given by Equations 10.3-3 and 10.3-4 actually satisfies 
Equation 10.3-1 in the m.s. sense. This necessary question is taken up in Problem 10.18 at 
the end of this chapter. 


10.4 ERGODICITY [10-3] 


Until now we have generally assumed that a statistical description of the random process 
is available. Of course, this is seldom true in practice; thus, we must develop some way of 
learning the needed statistical quantities from the observed sample functions of the random 
processes of interest. Fortunately, for many stationary random processes, we can substitute 
time averages for the unknown ensemble averages. We can use the stochastic integral defined 
in Section 10.2 to form a time average of the random process X(t) over the interval [—T, T], 


1 +T 
= X (t)dt. 
ar |, (t) 


In many cases this average will tend to the ensemble mean as T goes to infinity. When this 
happens for a random process, we say the random process is ergodic in the mean. Other 
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types of ergodicity can be defined, such as ergodic in power or ergodic in correlation function 
or ergodic in the CDF. Each type of ergodicity means that the corresponding time average 
converges to the named ensemble average. 

This gives us a way of learning the probability and moment functions, which can then 
be used to statistically characterize the random processes of interest. To study convergence 
of the respective time averages and hence determine whether the required ergodicity holds, 
we need to decide on a type of convergence for the random variables in question. For most 
of our work we will adopt the relatively simple concept of mean-square convergence. For 
example, we might say that the process X(t) is mean-square ergodic in both the mean 
function and the covariance function. 

The property of mean-square ergodicity occurs when the random process decorrelates 
sufficiently rapidly with time shift so that the time average in question looks like the average 
of many almost uncorrelated random variables, which in turn—by appropriate forms of the 
weak Law of Large Numbers—will converge to the appropriate ensemble average. That is, 
we can write upon setting AT = T/N, 


1 +T 1 +NAT 
— X (t)dt = ——— X (t)dt 
ap |_, *“Y4= anar He ) 
+(N-1) (n+1)AT 
1 1 
= — — X (t)dt 
2N 2s. (ar Ion ) 


where the terms in the sum are approximately uncorrelated if AT is large enough. If the 
random process stays highly correlated over arbitrarily long time intervals, then we would 
not expect such behavior, and indeed ergodicity would not hold. Two simple examples of 
nonergodicity are 


(1) X(t) = A where A is a random variable; 
(2) X(t) = Acos2rft + Bsin2nrft, where A and B are random variables with 
E|A] = E[B] =0, E[A?] = E[B?] > 0, and E[AB] = 0. 


Example 1 is clearly not ergodic because any time average of X(t) is just A, a random 
variable. Thus, there is no convergence to the ensemble mean E[A]. Example 2 can be shown 
to be WSS and ergodic in the mean but is not ergodic in power. 


Definition 10.4-1 A WSS random process is ergodic in the mean if the time average 
of X(t) converges to the ensemble average E[X(t)] = x, that is, 


M(T) = =— X(t)dt +ux (ms.) a Too. HH 


In the above equation we observe that the time average M is a random variable. Hence 
we can compute its mean and variance using the theory of m.s. integrals obtaining 


ES 1 tt 


BUM] = apf BLX (dt = Hx, (10.4-1) 
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s=tyt ty 


Figure 10.4-1 Square region in t1, t2 plane for integral in Equation 10.4-3. 


4 1 +T +T 
~ = ——> Kxx(ti — te)dtidte. 10.4-2 
oF OT? a e xx (ti — te)dtidty ( ) 
The mean of M is thus the ensemble mean of X (t), so if the variance is small the estimate 
will be good. Estimates that have the correct mean value are said to be unbiased (cf. 
Definition 5.8-2). Noting the mean value from Equation 10.4-1 we see that 
o% = E||M — px"); 

thus, the convergence of the integral in Equation 10.4-2 to zero is the same as the conver- 
gence of M to zx in the m.s. sense. Since we will mostly deal with m.s. ergodicity, we will 
omit its mention with the understanding that unless otherwise stated time averages are 
computed in the m.s. sense. 

To evaluate the integral in Equation 10.4-2 we look at Figure 10.4-1, which shows the 
area over which the integration is performed. Realizing that the random process X(t) is 
WSS and the covariance function is thus only a function of the difference between t, and 
tg, we can make the following change of variables that simplifies the integral: 


Os Os 
te ee ee 
ae with Jacobian |J| = abs 7 * | = abs ans [=|-21=2, 
T= th — te oe, on a 
Ot,  Ote 


so dtydtz = |J|~tdsdr = sdsdr, and Equation 10.4-2 becomes 
1 +27 [ p+(2T-Ir1) 4 427 
_— =Kxx(r)ds| dr = on | Kxx(r)(QQ0 —|rdr. (10.43 
art an [om iny BEER are |. Kxxt@P=|rhdr. (1043) 


Thus, we arrive at the equivalent condition for the ergodicity in the mean of a WSS random 
process: 


ie rer || 
Ti (.- 5) Kxx(t)dr > 0 To. 
27 
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Note: A sufficient condition for ergodicity in the mean is the stronger condition 
+00 
i |Kxx(r)|dr < co, 
—oco 


that is, that the covariance function be absolutely integrable. 


Theorem 10.4-1 A WSS random process X(t) is ergodic in the mean iff its covariance 


function Ky x(rT) satisfies 
1 +27 \7| 
— 1-—)|)K dr| = 
oT [., ( 4 5 eal 


We note that this is the same as saying that the estimate M converges to x in the mean- 
square sense, hence the name m.s. ergodicity. Since m.s. convergence also implies conver- 
gence in probability, we have that M also converges to the ensemble mean jx in probability. 
This is analogous to the weak Law of Large Numbers for uncorrelated random sequences as 
studied in Section 9.6. 


We can also define ergodicity in the higher moments, for example, mean square or 
power. 


Definition 10.4-2. A WSS random process X(t) is ergodic in mean square if 


Similarly, we can define ergodicity in correlation and covariance. 


Definition 10.4-3 A WSS random process X(t) is ergodic in correlation for shift 
(lag) A iff 


— | X(t+rA)X*(dt 


If this condition is true for all A, we say X is ergodic in correlation. The conditions for 
the preceding two types of ergodicities are covered by the following theorem on ergodicity 
in correlation, where we have defined the random process for each 4, 


®y(t) 2 X(t + A)X*(t). 


Theorem 10.4-2. The WSS random process X(t) is ergodic in correlation at the shift 


dX iff 
ie Jl Vx (r)dr| =0 (10.4-4) 
oT Le oT ©), 6) (T)AT} =U. : 
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Proof The time-average estimate of the correlation function found in Definition 
10.4-3 is just the time average of the random process ®)(t) for each fixed ». Thus, we 
can apply Theorem 10.4-1 to the ergodicity in the mean problem for ®)(¢) as long as this 
process is WSS, that is, X(t) has stationary second and fourth moments. The preceding 
condition is then seen to be the same as in Theorem 10.4-1 with the substitution of the 
covariance function Ko,6,(T) for Kxx(T). I 


Note that 
Ko,,(T) = Ro,®,(T) —|Rxx()|’, 
where 
Rao, (7) = E[®,(t + 7)®\ (6) 
= EB[X(t(+7T+A)X* (E+ 7)X*(t+A)X(t)], 
which shows explicitly that X(t) must have the fourth-moment stationarity here denoted 


for this theorem to apply. Some examples of WSS random processes that are ergodic in 
various senses follow. 


Example 10.4-1 
(random cosine) Consider the WSS process of Example 9.1-5, which is a random amplitude 
cosine at frequency fo with random phase, 


X(t) = Acos(27 fot + O) —0oo <t<+oo. 


d 


Here A is N(0,1), O is uniformly distributed over [—7,+7], and both A and © are inde- 
pendent. Then 
E|X (t)] = E[A]E[cos(27 fot + O)] = 0 


and 
a el 
E|X(t+7)X(t)] = ae cos(27 fot + 2a fot + 0) cos(27 fot + @)dd 
WT Jan 
1 
= cos(27 for), —0oo <T < +00, 


so that X(t) is indeed WSS. (In fact it can be shown that the process is also strict-sense 
stationary.) We first inquire whether X(t) is ergodic in the mean; hence we compute 
L pe |r|] 1 
2 
o~ = = 1 — — |} =cos(27 for) dr. 
raf, | 5 c0s(2r for) 
If we realize that the triangular term in the square brackets can be expressed as the convo- 
lution of two rectangular pulses, as shown in Figure 10.4-2, then we can write the Fourier 
transform of the triangular pulse as the square of the transform of one rectangular pulse of 


half the width: . : 
sin2rfT\" _ sin 27 fT 
(vr an fT =27( onfT : 
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Figure 10.4-2 Illustration of the convolution of two rectangular pulses. 


and then use Parseval’s theorem [10-4] to evaluate the above variance as 


1 42T l 1 ee ee 
Va Oe ee cos 2nfordr= sr f 2T (Sr) (1/4)[6(f+.fo)+6(f—fo)ldf. 


Thus, we have 


2 1 Gave 
on 


~ On foT 


M” 9 
Hence this random process is ergodic in the mean for any fp 4 0. 
To determine whether it is ergodic in power, we could use the condition of Equation 
10.4-3. However, in this simple case we can obtain the result by examining the time average 
directly. Thus, 


2 
) 0 asT—>o for fo £0. 


ae ee gi TO os 
— X*“(t)dt = A* — cos* (27 fot + O)dt. 
sp |, X*ae= AP ge | cos*2nfot-+ ©) 
Clearly, this time average will converge to A?/2 not to E[A?|/2 since for any ©, the time 
average of the cosine squared will converge to 1/2 for fo 4 0. Thus, this random cosine is 
not ergodic in power and hence not ergodic in correlation either. This is not unexpected 
since Kx x(r) does not tend to zero as |r| tends to infinity. (Kx x(r) is in fact periodic!) 


Another useful type of ergodicity is ergodicity in distribution function. Here we consider 
using time averages to estimate the distribution function of some random process X, which 
is at least stationary to first order; that is, the first-order CDF is shift-invariant. We can 
form such a time average by first forming the indicator process I,(t), 


Afi, if X(t) <a, 
E(t) = i else. 


Thus, the random process [,,(t) for each fixed x is one if the event { X(t) < x} occurs and 
zero if the event {X(t) < x} does not occur. The function I,,(¢t) thus “indicates” the event in 
the sense of Boolean logic. Since J,,(t) is a function of the random process X(t), it in turn is 
a random process. The time average value of I,,(t) can be used to estimate the distribution 
function Fy (a;t) = P[X(t) < a] as 
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x +T 
Fy(z) 4 ole I,(t)dt. (10.4-5) 


First we consider the mean of this estimate. It is directly seen that 
B[Bx(x)] = Elf (t)] = 1 PIX() <a] = Fx(ast), 
Next we consider the correlation function of J,(t), 
Elly (t1) Lz (t2)| = P[X(t1) < 2, X(t2) < a] = Fy (ax, 2; t1, te) = Fy (a, 2; t, — te), 


where the last line follows if X is stationary of order 2.1 Thus, we can say that J,,(¢) will 
be a WSS random process iff X(t) is stationary of order 2. In this case we can apply 
Theorem 10.4-1 to I,,(t) to get the following result. 


Theorem 10.4-3 A random process X(t), stationary up to order 2, is ergodic in 


distribution iff 
! [- find ieee Cae ey 
oT Jr gy | ite eer | ss 


where 


Thus, Ay, 1,(7) must generally decay to zero as |r| + +00 for the foregoing condition to be 
met; that is, we generally need 


Fx(2,2;T) > F}(x) as |r| > +00. (10.4-6) 


This would be saying that X(t + 7) and X(t) are asymptotically independent as 
Ir] + +00. 


Example 10.4-2 
(testing ergodicity) Let X(t) be a WSS random process with covariance function 


Kxx(r) = a, exp(—a|r]|). 


Then X(t) is ergodic in the mean since Kx x (7) is absolutely integrable. If we further assume 
that X(t) is a Gaussian random process, then using the Gaussian fourth-order moment 
property (cf. Problem 5.34), we can show ergodicity in correlation and hence in power 
and covariance. Also, again invoking the Gaussian property, we can conclude ergodicity in 
distribution since the Gaussian CDF is a continuous function of its mean and covariance so 
that Equation 10.4-6 is satisfied. 


+ We recall that “stationary of order 2” means that the second-order CDFs and pdf’s are time-invariant; 
that is, they only depend on the difference of the two times t; and te. 
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We note that the three theorems of this section give the same kind of condition for the 
three types of ergodicity considered. The difference between them lies in the three different 
covariance functions. The general form of the condition is 


wry in 
lin awl =i 
Jim op |. (1 f K(r)dr =0 (10.4-7) 


We now present an equivalent simpler condition for the case when K(r) has a limit as 
|| + +oo. 


Theorem 10.4-4 Ifa covariance K(rT) has a limit as |r| > +00, then Equation 10.4-7 
is equivalent to lim),|,.. A(T) = 0. 


Proof If A (7) tends to a nonzero value, then clearly Equation 10.4-7 will not hold. 
So assume lim),|,.. A(T) = 0; then for 7 large enough, say tT > To, |K(r)| < € so that 


3 . (1 7 a) K(q)dr 


which was to be shown. Here, M is a bound on |K(r)| over [—To, To]. I 


1 
2T 


1 
< ap ltte +2MTl, where M < ov, 


— 2¢e; as T — ov, 


10.5 KARHUNEN-LOEVE EXPANSION [10-5] 


Another application of the stochastic integral is to the Karhunen—Loéve expansion. The 
idea is to decompose a general second-order random process into an orthonormal expan- 
sion whose coefficients are uncorrelated random variables. The expansion functions are just 
deterministic functions that have been orthonormalized to serve as a basis set for this decom- 
position. The Karhunen—Loéve (K-—L) expansion has proved to be a very useful theoretical 
tool in such diverse areas as detection theory, pattern recognition, and image coding. It is 
often used as an intermediate step in deriving general results in these areas. We will present 
an example to show how the K—L expansion is used in optimal detection of a known signal 
in white Gaussian noise (Example 10.5-4). 


Theorem 10.5-1 Let X(t) be a zero-mean, second-order random process defined over 
[—T/2,+7T/2] with continuous covariance function Kx x(ti,t2) = Rxx(ti,t2), because of 
zero-mean. Then we can write 


X(t) = = Xn¢,(t) (ms.) for |t| < T/2 (10.5-1) 
n=1 
with je 
xX, / X(t)d* (dt, (10.5-2) 
-T/2 
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where the set of functions {¢,,(¢)} is a complete orthonormal set of solutions to the integral 


equation 
T/2 
; Kx x (ti, te) ¢,,(t2)dtz = An@,(t1), \ti| < 1/2, (10.5-3) 
-T/2 


and the coefficients X,, are statistically’ orthogonal; that is, 
E[XnX7,] = Andmn, (10.5-4) 
with bmn the Kronecker delta function. 


The functions ¢,,(t) are orthonormal over (—T'/2, T/2) in the sense that 


T/2 
/ n(t)Om(t)dt = bmn (10.5-5) 
-T/2 
where 6m» is the Kronecker delta. In fact it is easy to show that any two normalized solutions 
¢,(t) and ¢,,(t) to the integral Equation 10.5-3 must be orthonormal if (A, # Am) and 
both \’s are not zero. See Problem 10.27. Note that we could just as well have used the 
correlation function Rx x(t1,t2) here in the K—L theorem because the mean of the random 
process X(t) is assumed zero. 

The interesting thing about this expansion is that the coefficients are uncorrelated 
or statistically orthogonal. Otherwise any expansion such as the Fourier series expansion 
would suffice.t The point of this theorem is that there exists at least one set of orthonormal 
functions with the special property that the coefficients in its expansion are uncorrelated 
random variables. We break up the proof of this important theorem into two steps or 
lemmas. We also need a result from the theory of integral equations known as Mercer’s 
theorem, which states that 


Kxx(ti,t2) = S> Anon (ta) G9 (ta). (10.5-6) 


This result is derived in the appendix at the end of this chapter. Also, a constructive method 
is shown in Facts 1 to 3 of this appendix to find the {¢,, (t)}, the set of orthonormal solutions 
to Equation 10.5-3. 


Lemma 1 If X(t) = 0, Xn¢,,(t) (m-s.) and the X,,’s are statistically orthogonal, 
then the ¢,,(t) must satisfy integral Equation 10.5-3. 


Proof We compute X(t)X* = >0,, XmXi¢,,(t), thus 


E[X() XA] = SO E[XmXil b(t) = El|Xnl]0n (0); 


TWe say statistically orthogonal coefficients X,, to avoid confusion with the deterministic orthogonality 
of the basis functions @,, (t). 

Incidentally, it can be shown that the m.s. Fourier series coefficients of X(t) become asymptotically 
uncorrelated as T — oo and further that the K—L eigenfunctions approach complex exponentials as T’ — oo 
[10-6]. We will establish this result at the end of this chapter. 
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but also 
T/2 
E[X(t)X,] = E | X(t) () 8) 
-T/2 
T/2 
= Kxxi(t, 8),(s)ds, 
-T/2 
hence 
T/2 i 
Kxx(t, s)@,(s)ds = An@,(t) with A, = E\|x.i i= 
-T/2 


We next show a partial converse to this lemma. 


Lemma 2. If the orthonormal set {¢,,(t)} satisfies integral Equation 10.5-3, then 
the random-variable coefficients X, given in Equation 10.5-2 must be statistically ortho- 
gonal. 


Proof By Equation 10.5-2 


sO 


because the ¢,,’s satisfy Equation 10.5-3. Thus, 


T/2 
E[X, Xi] = E 


X (t)b;(t)dt Xa 
-T/2 


By combining the results in the above two lemmas, we can see that the K—L coefficients 
will be statistically orthogonal if and only if the orthonormal basis functions are chosen as 
solutions to Equation 10.5-3. What remains to show is that the K—L expansion does in fact 
produce a mean-square equality, which we show below with the help of Mercer’s theorem. 


Proof of Theorem 10.5-1 Define X(t) S yr Xnbn(t) and consider 


E(|X(t) — X(t)?] = ELX(t)(X(t) — X(t))*] — ELX(t)(X(t) — X(t))*]. The second term 


is zero since 
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m 


E[X(t)(X(t) — X(t))"] = E |S Xm X*(t)bm(t) — S> Xm Xhbm (0% (0) 
= S~ E[XmX* (b(t) — S> E[Xm Xt) bm OA (0) 


by the first step in the proof of Lemma 2, so evaluating the first term, we get 
El|X(¢) — XQ] = Kxx(t,t) — Do rnd, OOO); 
n=1 
which is zero by Equation 10.5-6, Mercer’s theorem. [gj 
We now present some examples of the calculation of the K—L expansion. 


Example 10.5-1 
(K-L expansion of white noise) Let Kww/(ti,t2) = 076(ti — ta). Then we must find the 
o,’s to satisfy the K—L integral equation, 


T/2 
a | (ti — ta) b(te)dtz = A(t), -T/2<t<4+T7/2, 
-T/2 
or 

0° $(t) = Ad (1). 
Thus, in this case the ¢(t) functions are arbitrary and all the A,,’s equal 07. So the expansion 
functions ¢,,(t) can be any complete orthonormal set with corresponding eigenvalues taken 
to be A, = 07. 


Note that this example, though easy, violates the second-order constraint on the covari- 
ance function in the K-L Theorem 10.5-1. Nevertheless, the resulting expansion can be 
shown to be valid in the sense of generalized random processes (cf. definition of white 
Gaussian noise in Section 10.1). 


Example 10.5-2 
(random process plus white noise) Here we look at what happens if we add a random process 
to the white noise of the previous example and then want to know the K—L expansion for 
the noisy process: 


Y(t) = X(t) + W(t), 


where W(t) is white noise and X and W are orthogonal (Definition 9.4-1), which we denote 
as X | W. Plugging into the K—L equation as before, we obtain 


T/2 T/2 
/ [Kx x(t, 8) +075(t—s)]¢ (s)ds = AM 6 (t) = Kxx(t,s)¢ (s)ds+o76™ (t), 
—-T/2 —-T/2 
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sO 
T/2 


Kxx(t, 8) (s)ds = (AW = a?) o (t), 
-T/2 
where we use the superscripts x and y to denote the respective eigenvalues and eigenfunc- 
tions of X(t) and Y(t). But since 


T/2 
Kxx(t,s)o (s)ds = 1 6 (t), 
-T/2 


we immediately obtain 
gM) =o (0 


and 
VO — 9g? = \@), 


We see that the eigenfunctions, that is, the K—L basis functions, are the same for both the X 
and Y processes. The K—L coefficients Y,, = (X,+W,), then have variances dw) = MN) 4.62, 


Example 10.5-3 
(K-L expansion for Wiener process) For this example let the time interval be (0, 7) to match 
our definition of the Wiener process. Using Equation 9.2-18 in the K—L integral Equation 
10.5-3, we obtain, 


E 
a? | min(t,)o(s)ds = d0(t), 0<t<T. (10.5-7) 
0 


o sp(s)ds +t ‘ o(s)ds| = Ad(t). 
[fooeens [soe 


We temporarily agree to set o? = 1 to simplify the equations. The standard method of 
solution of Equation 10.5-7 is to differentiate it as many times as necessary to convert it 
to a differential equation. We then evaluate the boundary conditions needed to solve the 
differential equation by plugging the general solution back into the integral equation. Here 
we take one derivative with respect to t and obtain 


iT: 
[ eeias = a600. (10.5-8) 


Taking a second derivative, we obtain a differential equation, 


with general solution, 
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Next we use the boundary conditions at 0 and T to determine the coefficients A, B, 
and A. From Equation 10.5-7 at t = 0+ we get (0+) = 0, which implies B = 0. From 
Equation 10.5-8 at t = T— we get 6(T—) = 0, which implies that 


a 2 
——— ] forn>1. 
\r 


1 
(a3 


cos(T/VX) = 0, that is, A= An = ( 
Finally, A is chosen such that ¢ is normalized, that is, 


T 
| ¢2(t)dt = 1, which implies A = \/2/T. 
0 
Thus, we get the following solution for the K—L basis functions, 


on(t) = V2/T sin [(n — 5) rt/T],n > 1. 


Now by Problem 10.27, the ¢,,(t) must satisfy ¢,  ¢,, for n # m since the eigenvalues 
An = [L'/(n—4)x}? are distinct. This is the K-L expansion for the standard Wiener process, 
that is, one with o = 1. If o £1, then X, is replaced by with o?Ap. 


We now present an application of the Karhunen—Loéve expansion to a simple problem 
from the area of communication theory known as detection theory. 


Example 10.5-4 
(application to signal detection) Assume we observe a waveform X(t) over the interval 
[—T'/2,+T/2] and wish to decide whether it contains a signal buried in noise or just noise 
alone. To be more precise we define two hypotheses, H, and Ho, and consider the decision 
theory problem: 
_fm(t)h+w(t): © 
= { W(t): Ho, 


where m/(t) is a deterministic function, that is, the signal, and W(t) is the noise modeled by a 
zero-mean, white Gaussian process. Note that this is the kind of hypotheses testing problem 
discussed in Chapter 7. Using the K—L expansion, we can simplify the preceding decision 
problem by replacing this waveform problem by a sequence of simpler scalar problems, 


ee eG Ay 


Wr: Ho, 


where m, and W,, are the respective K—L coefficients. 

Effectively we take the K—L transform of the original received signal X(t). The trans- 
form space is then just the space of sequences of K—L coefficients. Using the fact that 
the noise is zero-mean Gaussian and that the expansion coefficients are orthogonal, we 
conclude that the random variables W,, are jointly independent, that is, W, is an inde- 
pendent random sequence. The problem can be simplified even further by observing that 
Kww(ti,te) = of/6(ti — te) permits the ¢,,(t)’s to be any complete set of orthonormal 
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solutions to the K-L integral Equation 10.5-3. It is convenient to take ¢,(t) = cm(t) where 


cis the normalizing constant 
al ft? =if2 
c= / m?(t)dt ; 
-T/2 


and then complete the orthonormal set in any valid way. We then notice that all the m, will 
be zero except for k = 1; thus, only Xj is affected by the presence or absence of the signal. 
One can then show that this detection problem can finally be reduced to the scalar problem, 


To compute X, we note that it is just the stochastic integral 
T/2 
X,=c X(t)m(t)dt 
-T/2 
that is often referred to as a matching operation. In fact, it can be performed by sampling 
the output of a filter whose impulse response is h(t) = cm(T — t), where T is chosen large 
enough to make this impulse response causal. The filter output at time T is then X,. This 
filter is called a matched filter and is widely used in communications and pattern recognition. 


Another application of the K—L expansion is in the derivation of important results in 
linear estimation theory. Analogously to the preceding example, the approach is to reduce 
the waveform estimation problem to the simpler one of estimating the individual K—L 
coefficients (cf. Problems 10.29 and 10.30). 


10.6 REPRESENTATION OF BANDLIMITED AND PERIODIC PROCESSES 


Here we consider expansions of random processes in terms of sets of random variables. An 
example that we have already seen would be the Karhunen—Loéve expansion of Section 10.5. 
In general, the sets of random variables will contain an infinite number of elements; thus, we 
are equivalently representing a random process by a random sequence. This representation is 
essential for digital processing of waveforms. Also, when the coefficients in the representation 
or expansion are uncorrelated or independent, then important additional simplifications 
result. We start out by considering WSS processes whose psd’s have finite support; that 
is, the respective correlation functions are bandlimited. We then develop an m.s. sampling 
theorem. 


Bandlimited Processes 


Definition 10.6-1 A random process X(t) that is WSS is said to be bandlimited to 
[w1, we] if Sxx(w) = 0 for |w| ¢ [wi,w2]. When w; = 0, we say the process is lowpass, and 
we set w2 = We, called the cutoff frequency. 
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In the case of a lowpass random process X(t) we can use the ordinary sampling theorem 
for deterministic signals [10-4] to write the following representation for the correlation func- 
tion Rx x(r) of the lowpass function in terms of the infinite set of samples Rx x (nT) taken 
at spacing T = 1/we: 


sinw-(r — nT) 
R R T) . 10.6-1 
xx ( Ss xx(n oo ( ) 
n=—0co 
It turns out that one can define a mean-square sampling theorem for WSS random processes, 
which we next state and prove. 
Theorem 10.6-1 If a second-order WSS random process X(t) is lowpass with cutoff 


frequency w-, then upon setting T 4 T/We, 


+90 aH 
xo= x(n) — Gag) 


We point out that the foregoing equality is in the sense of an m.s. limit, that is, with 
+N 


Xy(t)= S> X(nT) 


n=—N 


then limy_soo E [|X(t) — Xw(t)|?] = 0 for each t. 


sinw.(t — nT’) 
w(t — nT) 


Proof First we observe that 
E[|X(t) — Xw()/?] = E[(X(é) — Xn (4) X*] — EX - Xn) XN). (10.6-2) 


Since X3,(t) is just a weighted sum of the X*(mT), we begin by obtaining the preliminary 
result that FE [(X(t) — Xy(t)) X*(mT)] — 0: 


B| (x0 )- Ix nD?) sin w(t — snes | X*(m) 


w(t — nT) 
+N A 
sinw,(t —nT 
= = Rxx(t- mT) -»> Rxx( nT mT) a nD) ) 
n=—N c 


N Rxx(t— mT) — Rxx(t — mT) =0, 


where the last equality follows by replacing + with t — mT in Ryxx(r) and writing the 


sampling expansion for this bandlimited function of t. Setting X(t) 4 lim X n(t) in the ms. 
sense, we get 7 
El(X(t) — X(t))X*(mT)] =0,' 


*We have used the fact that X(t) is second order, that is, E[|X(t)|?] < co, so that Theorem 10.1-5 
applies and allows us to interchange the m.s. limit with the expectation operator. 
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that is, the error X(t) — x (t) is orthogonal to X(mT) for all m. We write this symbolically 
as 
(X(t) — X(t)) L X(mT), Vm. 


But then we also have that X(t) — X(t) 1 Xy(t) because X y(t) is just a weighted sum of 
X(mT). Then letting N — +00, we get 


(X()-XW)LXH, Ve, 


which just means E[(X(t) — XA} = 0; thus, the second term in Equation 10.6-2 is 
asymptotically zero. Considering the first term in Equation 10.6-2 we get 


sinw-(t — nT’) 
we(t—nT) ” 


E((X(t) — Xw(t)) X*(@)] = Rxx(0)- SD) Rxx(nT- 1) 


which tends to zero as N — +00 by virtue of the representation 


sinw,(t — nT’) 
we(t — nT) 


Rxx(0) = s Rxx(nT — t) 


n=—Co 


obtained by right-shifting the bandlimited Rx x (rT) in Equation 10.6-1 by the shift t, thereby 
obtaining 
sinw.(t — nT’) 

w(t — nT) 


Rxx( 7—t) _> Rxx( nT t) 


n=— Co 


and then setting the free parameter r = t. Thus, E [|X(t)—Xw/(t)|?] = 0 as 
N—-+oo. 


In words, the m.s. sampling theorem tells us that knowledge of the sampled sequence 
is sufficient for determining the random process at time t up to an event of probability 
zero since two random variables equal in the m.s. sense are equal with probability-1. (See 
Theorem 10.1-5.) 

To consider digital processing of the resulting random sequence, we change the notation 
slightly by writing X, for the random process and use X to denote the corresponding random 
sequence. Then the mean of the random sequence X [n] 4 X_(nT) is uy = E[X,(t)], and 
the correlation function is Rxx[m] = Rx,x,(mT). This then gives the following psd for 
the random sequence: 


1 
Sxx(w) = FSx.Xa (=), lu] <7, 
if X(t) is lowpass with cutoff w. = 7/T. After digital processing we can restore the 


continuous-time random process using the m.s. sampling expansion. Note that we assume 
perfect sampling and reconstruction. Here the WSS random process becomes a stationary 
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random sequence and the reconstructed process is again WSS. If sample-and-hold type 
suboptimal reconstruction is used (as in Problem 9.1), then the reconstructed random 
process will not even be WSS. 

We see that the coefficients in this expansion, that is the samples of X,(t) at spacing T, 
most often will be correlated. However, there is one case where they will be uncorrelated. 
That case is when the process has a flat psd over a region of finite support and is lowpass. 


Example 10.6-1 
(lowpass noise with flat psd) Let X,(t) be WSS and bandlimited to (—w.,+w-) with flat 
psd 


Sx,Xa(w) = Sx.Xq(O)L(—-w.,+w.) (W) 
as seen in Figure 10.6-1. Then Rx, x,(T) is given as 


SiN WT 


We 
Rx,x,(T) = Rx, x, (0) with Rx,x,(0) = 7 BXaXa (0). 


WeT 


Since Rx, x, (mT) = 0 for m ¥ 0 we see that the samples are orthogonal, all with the same 
average power Rx, x,(0). Thus, the random sequence X[n] is a white noise and its psd is 
flat: 


1 
Sxx(w) = 7Sx.x,(0), wl <a 
with correlation function a discrete-time impulse 


Rxx|m] = Rx, x, (0) d[m]. 


Bandpass Random Processes 


Next we consider the treatment of bandpass random processes. Such processes are used 
as models for random signals or noise that have been modulated by a carrier wave to 
enable long-distance transmission. Also, sometimes the frequency selective nature of the 
transmission medium converts a wideband signal into an approximately narrowband signal. 
Common examples are radio waves in the atmosphere and pressure waves in the ground or 
underwater. 


Sx,x,(@) 


Figure 10.6-1 The psd of a bandlimited white random process. 
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First we show that we can construct a WSS bandpass random process U(t) using two 
lowpass WSS processes. Thus, consider a real-valued bandpass random process whose psd, 
for positive frequencies, is centered at wo: 


U(t) = X(t) cos(wot) + Y(t) sin(wof), (10.6-3) 

where the lowpass processes X and Y are real-valued, have zero-means, are WSS, and satisfy 
Kxx(t) = Kyy(7), (10.6-4) 

Kxy(rt) = —Ky x(t). (10.6-5) 


In a representation such as Equations 10.6-3 through 10.6-5, X is called the in-phase compo- 
nent of U, and Y is called its quadrature component. 

The symmetry conditions of Equations 10.6-4 and 10.6-5, while heavily constraining the 
component lowpass processes { X(t), Y(t)}, are sufficient to guarantee that Equation 10.6-3 
is a WSS process (cf. Problem 10.37). We show below that this model is general enough 
to describe an arbitrary WSS bandpass noise process. In particular, we will see that the 
cross-covariance terms Kxy(rT) model the part of a general psd Syy(w) which is odd or 
nonsymmetrical about the center frequency wo, while the covariance terms Kx x(7) model 
the part that is even with respect to the center frequency. 

From Equations 10.6-4 and 10.6-5, it follows that 


Kyu (tT) = Kxx(t) cos(woT) + Ky x(7) sin(worT) 


or, as in the frequency domain, 


1 1 
Suu(w) = 3 [Sxx(w — wo) + Sxx (wt wo)] + 5 [Syx(w — wo) — Syx(w + wo)]. 
Now Kxy(r) = —Kyx(r). Also, Kyx(7T) = KX y(—-T) = Kxy(-T) since X and Y are 
real-valued, so 
Sy x (w) = Syy (w) _ Sxy(—w) = —Sxy(+w). 


From S\y-(w) = —Sxy(w) we get that Sxy is pure imaginary. From Sxy(—w) = —Sxy(w) 
we get that Syy is an odd function of w. The same holds for Syx since Kxy(rT) = 
K}x(-17) = Kyx(—r7); thus, $Sxx(w — wo) is the in-phase or even part of the psd at 
wo, and (1/27)Sy x (w — wo) is the quadrature or odd part. These properties are illustrated 
in Figure 10.6-2. 

Thus we can conclude that Equation 10.6-3 together with the symmetry conditions 
of Equations 10.6-4 and 10.6-5 on the component lowpass processes X(t) and Y(t) are 
sufficiently general to model an arbitrary WSS bandpass process U(t). 

To find the lowpass components, we could decompose the process U(t) as shown in 
Figure 10.6-3; however, the random processes obtained after multiplication by cos and sin 
are not WSS so that the system of the figure may not be analyzed in the frequency domain. 
An alternative approach is through the Hilbert transform operator, defined as filtering with 
the system function 


H(w) = —jsgn(w). 
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Suu) 


TN 


2 COS wot 


2sin wot 


Ideal Y(t) 
LPF 


Figure 10.6-3 Decomposition of bandpass random process. 


Using this operator we can define an analytic signal process Z(t), 


Z(t) 5 U(t) +jU(0), (10.6-6) 


where the superscript ~ indicates a Hilbert transformation of U(t). Then it turns out that 


we can take 
X(t) = Re[Z(t)eJ“°"] 


and 
¥(t) =—-Im|[Z@)e |, 


to achieve the desired representation (Equation 10.6-3). These X and Y are actually the 
same as in Figure 10.6-3. The psd of Z(t) is 


Szz(w) = 4Suu(w) u(w), (10.6-7) 


where u(w) is the unit-step function. This psd is sketched in Figure 10.6-4, where we note 
that its support is restricted to positive w. 
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S27) 


—Wo 0 W— % +a 


Figure 10.6-4 psd of analytic signal random process. 


One can use this theory for computer simulation of bandpass processes by first decom- 
posing the process into its X and Y lowpass components, or equivalently the complex 
lowpass process X (t)—jY (¢), and then representing these lowpass processes by their sampled 
equivalents. Thus, one can simulate bandpass processes and bandpass systems by discrete- 
time processing of coupled pairs of random sequences, that is, vector random sequences of 
dimension two, at a greatly reduced sample rate because X and Y are lowpass. 


WSS Periodic Processes 


A WSS random process may have a correlation function that is periodic; that is, R(r) = 
R(r + T) for all 7. This is a special case of the general periodic process introduced in 
Chapter 9. 


Definition 10.6-2 A WSS random process X(t) is mean-square periodic if for some 
T we have Rx x(t) = Rxx(t+T) for all 7. We call the smallest such T > 0 the period. 


In Chapter 9, we called such processes wide-sense periodic. In Problem 10.38 the reader 
is asked to show that an m.s. periodic process is also periodic in the stronger sense: 


E ||X(t)— X(t+T)|?] =0. 
We now show this directly for the WSS periodic case. Evaluating we get 
E||X(t) — X(t + T)/"] = 2(Rxx (0) — Re[Rxx(T))). 


Now since Rxx(0) = Rxx(T), it follows that Rx x(T) is real and Re{Rxx(T)} = Rxx(T) 
= Rxx(0). Hence X(t) = X(t+T) (m.s.) and hence also with probability-1. 

Turning to the psd of a WSS periodic process, we know that Rxx(7T) has a Fourier 
series 


with coefficients 
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Thus, the psd of a WSS periodic process is a line spectrum of the form 


+00 
Sxx(w) = 27 ‘> Qn 0(w — nwo), (10.6-8) 


n=—CoO 


which can be summarized in the following theorem. We note that the a,’s are thus neces- 
sarily nonnegative. 


Theorem 10.6-2 If a WSS random process X(t) is m.s. periodic, then its psd is a 
line spectrum with impulses at multiples of the fundamental frequency wo. The impulse 
areas are given by the Fourier coefficients of the periodic correlation function Rx x(r). 


Example 10.6-2 
(filtering m.s. periodic process) Let the input to an LSI system with frequency response 
H(w) be the periodic random process X(t) as indicated in Figure 10.6-5. In general we have 


Syy (w) = |H(w)/?Sxx(w). 
Using Equation 10.6-8, we get 


+00 
Syy(w) = 27 S- On| H(won)|?5(w — nwo). 


n=—Cco 


Hence the output Y(t) is m.s. periodic with the same period T and has correlation function 
given by 
+00 
Ryy(T) = we An |H (won) |2e7I0”" 


n=— Co 


Deterministic functions that are periodic can be represented by Fourier series as we have 
done for the periodic correlation function. We now show that the WSS periodic process itself 
may be represented as a Fourier series in the mean-square sense. 


Theorem 10.6-3 Let the WSS random process X(t) be m.s. periodic with period T. 


Then we can write 


+00 
X()= > Anetiort,  (m.s.) 


n=—co 


X(t) = X(t+T) Y(t) 
H(a) 


Figure 10.6-5 Periodic random process input to LSI system. 
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where wg = 2a/T, with random Fourier coefficients 
1 ptt? 
An = z/ X(r)e7 I" dr, (m.s.) 
T J_t/2 
with mean 
E[An] = Hx 4|n], 
and correlation 
E|AnA*,] = and[m — nl], 
and mean-square value 
A. ptt? 
n== Rxx(t)e 70" dr. (10.6-9) 
T Jetje 


Thus, the periodic random process can be expanded in a Fourier series whose coefficients 
are statistically orthogonal. Thus, the Fourier series is the Karhunen—Loéve expansion (see 


Section 10.5) for a WSS periodic process. 


Proof First we show that the A, are statistically orthogonal. We readily see that 


+T/2 
BiAl= pf BX ule mda = pdb 


Then 


47/2 . 
E|A; X(t)] = Z i E[X*(u)X(t)]et#0"™ du 


1 ptr 
= z/ Rxx(t — ujetI#o"™ du 
_T/2 


1 T/2+t ; : 
We / Rxx (r)e9%0"" dr etivont 
P J—rjayt 


= Aner Ivor, 


since the integrand is periodic in 7 with period T and also by Equation 10.6-9. Next we 


consider 
1 ptt 
E[A,A*] = = : E[A* X (u)]e~549*" du. 
T J_v/2 
1 pttee 
= An z/ etiwonu,—jwoku qi, 
P J-ry/2 


= a,,0[n — ky]. 
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It remains to show that 
2 


> Agetioon| | =o. 


n=— Co 


oe the left-hand side, we get 


> BIA e~Jwont _ » E[A,X*(t)]etio"* + SS” BLA, Ag eto mt 
n k 
= Rxx(0 Vs Bae Dent Dor 


But the a, are real since a, = E||A,|?], and also Rx x(0) = D> ap so that we finally have 


Svan — Son — So ant+ S an =0. 


As shown in Problem 10.40, analogous results hold for WSS periodic random sequences; 
that is, Rxx|[m] = Rxx|m-+T] for all m where T = positive integer. 


Fourier Series for WSS Processes 


Of course, we can expand a general nonperiodic WSS random process in a Fourier series also. 
However, the corresponding expansion coefficients will not be statistically orthogonal. Yet in 
the limit as the length T of the expansion interval tends to infinity, the coefficients become 
asymptotically orthogonal (also asymptotically uncorrelated since zero-mean is assumed). 
Recalling the expansion and coefficient definition from Theorem 10.6-3, we can express the 
cross-correlation as 


+T/2 p+T/2 
E[An Ay] = af rf Rxx(ti — te)e™ dwotnte— ita) ) dty dt. 
T/2 J-T/2 


Next make the substitution ¢ = t; + tg and 7 = t; — tg with Jacobian |J| = 2 as in the 
calculation of the psd in Section 9.5 to obtain 


oo. pe _ 1 ope(r-lr) o 
E|A, Az] = P|, Rxx(r)exp (iFo+m)(3 | exp (—i2(n—m)t) dt | dr 


—(T-|r|) 


1 2° 


= 5 Rxx(T) exp G+ ce m)r) sin (22(n — m)(T — |r|)) 


wo(n — m)/2 


72 dtr 


1 
= a T . 
O(mig)- as = © 


If we let n = m, we get the coefficient variance (also mean-square value) as 


2 1 = = | +jwont 4 1 
E||An|*] = T I, Tr Rxx(rIJe dr converges to pSxx (nwo) as To. 
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Related to this Fourier series expansion are the discrete-time Fourier and Cosine trans- 
forms used in audio and video data compression [10-7]. These block transforms,' like the 
Fourier series for continuous expansion over a finite-length interval, asymptotically achieve 
the statistical orthogonality property of the exact Karhunen—Loéve transform, when the 
random sequence is WSS. 

We still must show that the resulting expansion converges in the mean-square sense. 
Unlike the proof in the periodic case, we do not have the exact statistical orthogonality that 
was used in the proof of Theorem 10.6-3. Instead we proceed as follows: 


E(|X — X?] = E[(X — X)(X* — X*)] 


where 
7 1 fs a 
X(t) — X(t) = X(t) - 3 | 7 X(m1) >» pies dr 

ac ny=—N 

and 

a 1 5 +N 
X*(t) - ¥(h) = XH) - 5 / 7X" (72) 3 caecael drs. 

~?2 ng=—N 

Thus 


E|(X (t1) — X(t1))(X* (ta) — X*(t2))] 


1 4 +N 
= Rxx(ti,te) -— fp rextrnta| > asl dt, 
=o 


ny=—N 


T 
1 2 

= a fp Rxx(ts7s) 
a) 


n2=— 


1 7 fe oy +N . 
+ eae x etme) x pee ae. 
SS De 


ny=—N ng=—N 


+N 
b> es dro 


Considering the third term, 


1 £ +N 
i fp Pexttnra| ye eats) dt, 
—3 


ng=—N 
the sum inside the square brackets tends to the impulse Té(t2 — T2) as N — oo. Thus, 
overall this term becomes asymptotically equal to Rx x(ti,t2). The same is true for the 


second term. The last term becomes 


TBy block transform, we mean a transform of a finite block of data rather than the entire waveform. 
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i fe fe aad ay 
af [reve | ye see) S- e iwona(ta—T2) 1 drs dro 
a meh 


ny=—N ng=—N 


Lr T 
1 2 2 

a ape je ie Rxx(71,72)T6(ti — 71) T6(te — T2) dr1 dra 
2 2 


= Rxx(ti, ta). 


Since each of the four terms asymptotically approaches Ry x(t1, tz), the mean-square value 
of X — X is zero, as was to be shown. 


SUMMARY 


We have furnished extensions of important ideas from ordinary calculus such as continuity, 
differentiation, integration, and resolution to random processes. These extensions are known 
as the mean-square calculus. This enabled us to define useful differential-equation and inte- 
gral operators for a wide class of second-order random processes. In so doing, we derived 
further results on m.s. convergence, including the notion that random variables with finite 
mean-square value, that is, second-order random variables, can be viewed as vectors in a 
Hilbert space with the expectation of their conjugate product serving as an inner product. 
This viewpoint will be important for applications to linear estimation in Chapter 11. 

We defined the m.s. stochastic integral and applied it to two problems: ergodicity, the 
problem of estimating the parameter functions of random processes, and the Karhunen— 
Loéve expansion, an important theoretical tool that decomposes a possibly nonstationary 
process into a sum of products of orthogonal random variables and orthonormal basis 
functions. 


APPENDIX: INTEGRAL EQUATIONS 


In this appendix we look at some of the properties of the solution of integral equations that 
are needed to appreciate the Karhunen—Loéve expansion presented in Section 10.5. In Facts 
1 to 3 that follow, we develop a method to solve for a complete set of orthonormal solutions 
{An; bn (t)}. Consider the integral equation, 


[ R(t, s)d(s)ds = Ad(t) ona<t<b, (10.A-1) 


where the kernel R(t, s) is continuous, Hermitian, and positive semidefinite. Thus, R fulfills 
the conditions to be a correlation function, although in the integral equation setting there 
may be no such interpretation. The solution to Equation 10.A-1 consists of the function 
@(t), called an eigenfunction, and the scalar X, called the corresponding eigenvalue. To avoid 
the trivial case, we rule out the solution ¢(t) = 0 as a valid eigenfunction. 

A fundamental theorem concerns the existence of solutions to Equation 10.A-1. Its proof 
can be found in the book on functional analysis by F. Reisz and B. Sz-Nagy [10-8]. 
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Existence Theorem 


If the continuous and Hermitian kernel R(t, s) in the integral Equation 10.A-1 is nonzero and 
positive semidefinite, then there exists at least one eigenfunction with nonzero eigenvalue. 


While the proof of this theorem is omitted, it should seem reasonable based on our expe- 
rience with computing the eigenvalues and eigenvectors of covariance matrices in Section 5.4. 
To see the connection, note that the integral in Equation 10.A-1 is the limit of a sum 
involving samples in s. If we require its solution only at samples in t, then we have equiva- 
lently the vector-matrix eigenvector problem of Section 5.4. So with some form of continuity 
in R and ¢, the properties of the eigenvectors and eigenvalues should carry over to the 
present eigenfunctions and eigenvalues. 

The existence theorem allows us to conclude several useful results concerning the solu- 
tions of Equation 10.A-1, culminating in Mercer’s theorem, which is applied in Section 10.5 
to prove the Karhunen—Loéve representation. This method is adapted from [10-2]. 


Fact 1. All the eigenvalues must be real and nonnegative. Additionally, when the kernel 
R(t, 8) is positive definite over the interval (a,b), they must all be positive. 


Proof Since R is positive semidefinite, 


ea o* (t) R(t, s)¢(s)dtds > 0, 


but 


[eo 


b b 
/ reste dt = / o* (t)\6(t)dt 


b 
= / \o(t) Pat. 


Thus, A is real and nonnegative since i |@(t)|?dt A 0. Also if R is additionally positive 
definite, then A cannot be zero. [J 


Fact 2. Let R(t,s) be Hermitian and positive semidefinite. Let ¢,, A1 be a corresponding 
eigenfunction and eigenvalue pair. Then 


A * 
Ry(t, s) = R(t, 8) — A191 (t)o7(s) 
is also positive semidefinite. 


Fact 3. Either Ri(t,s) = 0 for all t and s or else there is another eigenfunction and 
eigenvalue ¢5, A2 witht ¢, L ¢, such that 


Ro(t,s) 2 Ri(t, s) — Abo (t)3(s) 


is positive semidefinite. 


TBy ¢g L ¢, we mean ji? $1 (t)¢5(t)dt = 0, so that upon normalization, assuming they are non-zero, ¢, 
and ¢4 become orthonormal (Equation 10.5-5). 
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Continuing with this procedure, we eventually obtain 


Ry(t, 8) > An bn (t) 


Now since 
Ry(t,t) = ~YAal(d (t)? = 0, 


we have that the increasing sum a An|@,(t)|? is bounded from above so that it must 


tend to a limit. Thus, R..(t, s) exists for s = t. For s ¥ t consider the partial sum for m > n, 


Rm, nil (t, 8) 2 Ano, (t )oz(s 


Then |ARm.n(t,t)| + 0 as m,n — oo. Using the Schwarz inequality we conclude! 


|ARmn(ts8)| $ /ABmn(tt)-ARmn (5,8). 


Thus, |ARmn(t, s)| ~ 0 as m,n — oo. Therefore, we can write 
Roo (t, 8) )— Yo rvealt (10.A-2) 


Mercer’s theorem now can be seen as an affirmative answer to the question of whether or 
not R.(t,s) = 0. We now turn to the proofs of Facts 2 and 3 prior to proving Mercer’s 
theorem. 


Proof of Fact 2 We must show that R, is positive semidefinite over (a,b). We do 
this by defining the random process 


v(t) 2 X(t) - H(t) © f Xroinar 


and showing that R, is its correlation function. We have 
b 
EY ()¥"(s)] = E[X(t)X"(s)] — nh E[X* (7) X(t)]o1 (7) dr 
6400) [ EIX)X*(s)o4(nar 


vaeta fo (71) X*(ra)]8% (71), (ra) dridre 


+We can regard ARm,n(t, s) as the correlation of two RV’s with variance ARmn(t,t) and ARm,n(s, 8). 
Alternatively we can use the Schwartz inequality for complex numbers: |Xa;b;|? < (Xlai|?) (S|bi|). 
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b 
= Rxx(tes) = 66s) f° Rxx(t,r)6r(r)dr 
b 
~6,(t) [ Rxex(r.s)0i(r)dr 
b pb 

+6,(061() ff Rxx(ri.72)0i(r1)oo(ra)dridra, 

but A1¢,(t) = i Rxx(t,7)¢,(7)d7 and using RY y(t,7T) = Rxx(7,t) we get 
b 
Adi(s) = fo Rxxlr,s)oi(rar 

so that the two cross-terms are each equal to —A1¢,(t)¢](s). Evaluating 


b b b b 
i / Rex Oe Gans = | é*(r4) / Rivstrisra\di(aldra| ers 


b 
= / Keine 


=a floor 
=, 
and combining, we get 
Ryy (t, 8) = Rxx(t,8) — A161 (t)41(s), 
which agrees with the definition of R, in Fact 2 with Ry, =R. 
Proof of Fact 3 Just repeat the proof of Fact 2, with R; in place of R to conclude 


that Re is positive semidefinite. To show that ¢. L ¢,, we proceed as follows. We first note 
that 


b 
dada(t) = | Ra(t,s)oo(s)ds 
and then plug in the definition of Ri to obtain 


b b 
Nada (t) = | R(t, 8)b(s)ds — Ardy (t) | 6%(s)¢o(s)ds. 


Then we multiply by ¢;(t) and integrate over (a,b) to obtain 
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ra foil $1 (t)do(t)dt = [ [ew (t)d5(s) R(t, s)dtds 
=n [ Id, (8) Pat - ot $1(s)¢a(s)ds 


= | “pal 7 “RU vesco ds— | " Sloools)ds 


b b 
=f dals)Xioi(s)ds—r [ oi(o)oals)as 


by the Hermitian symmetry of R, that is, R(t,s) = R*(s,t), we thus obtain 


b b 
‘: i: 6% (t)oo(#)dt = (AT — Aa) / ORO. 
Now Aj, is real so : 
4 | bt (1) (t)at = 0. 


Thus, either Az = 0 or ¢, L ¢,. We can reject the first possibility by the existence theorem 
and hence we are done. 


Mercer’s Theorem Let the kernel R(t,s) be continuous, Hermitian, and positive 
semidefinite. Let {A,,, ¢,,(t)} be the (possibly infinite) complete set of orthonormal solutions 
to the integral Equation 10.A-1. Then the following expansion holds for all t, s € [a, }], 


= J 5 Angn(t)on(s) 


Proof By Equation 10.A-2 we know that the question reduces to whether the positive 
semidefinite kernel Rx is equal to zero. If it is not zero, then by the existence theorem 
there is an eigenfunction and nonzero eigenvalue X for R,. Since A > 0, adding this new 
eigenfunction-eigenvalue pair to the right-hand side of Equation 10.A-2, we get a change 
in the value of Rx, which contradicts the assumed convergence. Thus, R. = 0 and the 
theorem is proved. 


PROBLEMS 
(*Starred problems are more advanced and may require more work and/or additional 
reading.) 


10.1 Use Theorems 10.1-5 and 10.1-6 to show the following properties of the ms. 
derivative. 
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(a) © (aXi(t) + bXa(t)) = od | eu 


- a 
(b) E[X1(t1)X3 (t2)] = 5 Rx x2 (tte). 
2 
10.2 Let X(t) be a random process with constant mean jy (4 0) and covariance function 


Kxx(t, 8) = 07 coswo(t — s). 


(a) Show that the m.s. derivative X’(t) exists here. 
(b) Find the covariance function of the m.s. derivative Kx/x/(t, s). 
c) Find the correlation function of the m.s. derivative Rx:x:(t, s). 


( 
10.3 Let the random process X(t) be WSS with correlation function 


Rxx(T) = o2e7 (T/T) 


Let Y(t) = 3X(t) + 2X'(t), where the derivative is interpreted in the m.s. sense. 


(a) State conditions for the m.s. existence of such a Y(t) in terms of a general 
correlation function Rx x(T). 

(b) Find the correlation function Ryy(r) for the given Rxx(r) in terms of o? 
and T. 


10.4 Let X(t) be a stationary random process with mean fy and covariance function 


2 
ox 


=F hg lee -—o <T< +00. 


Kxx(r) 


(a) Show that an m.s. derivative exists for all t. 
(b) Find py(t) and Ky y(7) for all t and r. 
*10.5 Carefully show from basic definitions that the mean-square integral is linear (in the 
integrand). That is, for a fixed interval 0 < t < T, we wish to conclude that the 
following equality must hold in the mean-square sense: 


ui T T 
i (a, X1 (t) + ayX9(t)) dt = a [ X(t) dt + a> | X(t) dt. 


Assume that each individual integral on the right-hand side exists in the mean- 
square sense and that the complex constants a; are finite. (Hint: Use the triangle 
inequality for the norm ||X(t)|| & V/ E||X|?, that is, ||X + Y]| < ||X|] + |/¥]|.) 

*10.6 Show from the basic definitions that the mean-square integral is linear in its upper 
and lower limits of integration. That is, for an arbitrary second-order random 
process X(t), show the following mean-square equality: 


[ X(t) dt = [ X(t) dt+ [ X(t) dt, 
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for all —00 < t) < tg < tz < +00. (Hint: Since second-order means E||X (t)|?] < 00, 
we already know that each of the above m.s. integrals exists, because it must be that 


tj pty 
/ i Rxx(t, s) dtds < ov, 
ti Le 


for any finite t; < t;. You may want to make use of the triangle inequality.) 
10.7 To estimate the mean of a stationary random process, we often consider an integral 
average 


1 T 
eae al X(t)dt, T>O0. 


(a) Find the mean of I(T) as a function of T, denoted y,;(T) for T > O asa 
function of the unknown mean py. 

(b) Find the variance of I(T), denoted o7(T) for T > 0 as a function of the 
unknown covariance function Kx x(T). 


*10.8 Let X(t) be a WSS, zero-mean, Gaussian random process with zero mean. Show 


that the m.s. derivative of Y(t) S X?(t) is Y(t) = 2X(t)X(t) and find the corre- 
lation function of Y in terms of Rxx and its derivatives. (Hint: Recall that for 
jointly Gaussian random variables E|X,X2X3X4| = Ryo R34 + Ry3Ro4 + Ry4Ro3 

10.9 Let Y(t) = cos(27 fot)-X (t), where X(t) is a second-order random process possessing 
an m.s. derivative X’(t) at each time t. Show that the product rule for derivatives 
is true for m.s. derivatives in this case. In particular we have the m.s. equality, 


Y'(t) = —27 fo sin(27 fot) X(t) + cos(27 fot) X'(t). 


[Hint: Use the triangle inequality applied to the error for an approximate derivative 
calculated as a difference for times ¢ and t + 1. Then add and subtract X(t + +) 
inside. Letting n — co should yield the desired result.] 
10.10 This problem concerns a property of the correlation function of a general indepen- 
dent increments random process. 
(a) If U(t) is an independent increments random process, with zero-mean, show 
that Ryu (t1,t2) = f(min(t1,t2)) where f(t) = E[U2(d)]. 
(b) Using the definition of an independent process (see Section 9.4), show that 
the generalized m.s. derivative U’(t) is an independent process. 
(c) What is the condition on the function f(t) such that the random process 
U'(t) would be wide-sense stationary? 


10.11 Consider the running integral 
t 
Y(t) = | a(t,7)X(r)dr, for t>0. 
0 


(a) State conditions for the m.s. existence of this integral in terms of the kernel 
function a(t,7) and the correlation function of X(t). 
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(b) Find the mean function py-(t) in terms of the kernel function a(t, 7) and the 
mean function jx (t). 

(c) Find the covariance function Kyy(t, s) in terms of the kernel function a(t, 7) 
and the covariance function Kx x(t, s). 


10.12 This problem concerns the m.s. derivative. Let the random process X(t) be second 
order, that is, E[|X(t)|?] < oo, and with correlation function Rx x (ti, t2). Let the 
random process Y(t) be defined by the m. s. integral 


(a) State the condition for the existence of the m. s. integral Y(t) for any given 
t in terms of Rx x (ti, ta). 

(b) Find the correlation function Ryy (ti, t2) of Y(t) in terms of Rx x(t, te). 

(c) Consider the m. s. derivative dY (t)/dt and find the condition on Rx x (ty, tz) 
for its existence. 


*10.13 (a) Let X[n] be a sequence of Gaussian random variables. Let Y be the limit of 
this sequence where we assume Y exists in the mean-square sense. Use the 
fact that convergence in mean square implies convergence in distribution to 
conclude that Y is also Gaussian distributed. 

(b) Repeat the above argument for a sequence of Gaussian random vectors 
X[n] = [Xi[n],...,Xx[n]]". (Note: Mean-square convergence for random 
vectors means 


E(|X[n] — X|7] + 0 
where |X|? S part X?, and Chebyshev’s inequality for random vectors is 


E||X|?] 
©2 


[x|? 


P||X| >e] < 2 f(x)dx = ) 


e2 


(c) Let X[n] 4 [Xn (t1),---;Xn(tx)]” and use the result of part (b) to conclude 
that an m.s. limit of Gaussian random processes is another Gaussian random 
process. 


10.14 Let X(t) be a Gaussian random process on some time interval. Use the result of 
the last problem to show that the m.s. derivative random process X‘(t) must also 
be a Gaussian random process. 

10.15 Let the stationary random process X(t) be bandlimited to [—w.,+w-] where w, is 
a positive real number. Then define Y(t) as the output of an ideal lowpass filter 
with passband [—w ,,-+w1] and input random process X(t). 


(a) Show that X(t) = Y(t) in the ms. sense if wy > we. 
(b) In the above case, also show that X(t) = Y(t) with probability-1. 
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10.16 Consider the following mean-square differential equation, 


Xo +3X(t)=U(t), t>to, 


driven by a WSS random process U(t) with psd 


1 
Suy(w) = +=—. 
uu ( ) we +4 
The differential equation is subject to the initial condition X(to) = Xo, where 
the random variable Xo has zero-mean, variance 5, and is orthogonal to the input 
random process U(t). 


(a) Asa preliminary step, express the deterministic solution to the above differ- 
ential equation, now regarded as an ordinary differential equation with deter- 
ministic input u(t) and initial condition xo, not a random variable. Write your 
solution as the sum of a zero-input part and a zero-state part. 

(b) Now returning to the m.s. differential equation, write the solution random 
process X(t) as a mean-square convolution integral of the input process U(t) 
over the time interval (tg, t) plus a zero-input term due to the random initial 
condition Xo. Justify the mean-square existence of the terms in your solution. 

(c) Write the integral expression for the two-parameter output correlation func- 
tion Rx x(t, s) over the time intervals t,s > to. You do not have to evaluate 
the integral. 


10.17 Consider the m.s. differential equation 


dY (t) 7 
Gt YO) =X) 


for t > 0 subject to the initial condition Y(0) = 0. The input is 
X(t) = 5cos 2t + W(t), 


where W(t) is a Gaussian white noise with mean zero and covariance function 
Kww/(rt) = 076(r). 

(a) Find y(t) for t > 0. 

(b) Find the covariance Kyy (ti, ¢t2) for t1 and tz > 0. 

(c) What is the maximum value of o such that 


PIV (t) — py (t)| < 0.1] >0.99 for allt > 0? 


Use Table 2.4-1 for erf(-) in Chapter 2. 


1 % 1,,2 
(«rit = == | e 2" du, x> 0). 
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10.18 Show that the m.s. solution to 


Y(t) + a¥(t) = X(t), 


that is, the random process Y(t) having Ryy(t,,t2) and Ryy(t1,t2) of 
Equations 10.3-3 and 10.3-4 actually satisfies the differential equation in the mean- 
square sense. More specifically, you are to show in the general complex-valued case 
that, 


E fre +aY(t)— xo | =0. 


10.19 Find the generalized m.s. differential equation that when driven by a standard 
white noise process W(t), that is, mean 0 and correlation function d(T), will yield 
the solution random processes X(t) with psd 


1 
S. = — —. 
XxX (w) (1 4 w)? 
(Note: We want the causal solution here, so differential equation poles should be 
all in the left half of the complex s-plane.) 
10.20 Consider the following generalized mean-square differential equation driven by a 
white noise process W(t) with correlation function Ryww/(rT) = 0(7): 


dX (t) dW (t) 
——* X(t) = ——— 42 t t>t 
T+ 3x@)= 4 2W), Bt 
subject to the initial condition X(to) = Xo, where Xo is a zero-mean random 


variable orthogonal to the white noise process W(t). 


(a) Write the solution process X(t) as an expression involving a generalized 
mean-square integral of the white noise W(t) over the interval (to, t). 

(b) If tj = —oo, then X(t) becomes wide-sense stationary. Find the autocorre- 
lation function Rx x(r) of this WSS process of part (b) above. 

(c) Is X(t) a Markov random process? Justify your answer. 


10.21 In Figure P10.21, a WSS random signal X(t) is corrupted by additive WSS noise 
N(t) prior to entering the LTI system with impulse response h(t). The output of 
the system is intended as an estimate of the noise-free random signal X(t) and 


N(t) 


Nn 
X(t) X(t) 
h(t) a 


Figure P10.21 Linear estimator of X(t) from signal plus noise observations. 


692 


Chapter 10 Advanced Topics in Random Processes 


10.22 


10.23 


is denoted X(t). Here X and N are mutually orthogonal random processes. The 
frequency response H(w) is specified as 


_ Sxx(w) 
ae Sxx(w)+ Snyn(w)’ 


where Sx x and Syw are the corresponding psd’s. 


(a) In terms of h, Rxx, and Ryn, write the integral expressing the condition 
for mean-square existence of the estimate X(t). 
(b) Use psd’s and the given frequency response H(w) to find a simpler frequency- 
domain version of this condition. 
(c) Show that the condition in (b) is always satisfied for a WSS second-order 
random process X(t). 
To detect a constant signal of amplitude A in white Gaussian noise of variance o? 
and mean zero, we consider two hypotheses (i.e., events): 


Ho: R(t) = Wit) 


AM: R(t)=A+W(t) \ for t € [0,7]. 


It can be shown that the optimal detector, to decide on the correct hypothesis, first 


computes the integral 
T 
AS | R(t)dt, 
0 


and then performs a threshold test. 
(a) Find the mean value of the integral A under each hypothesis, Hy and Hy. 
(b) Find the variance of A under each hypothesis. 


(c) An optimal detector would compare A to the threshold Ag 4 AT/2 in the 
case when each hypothesis is equally likely, that is, P|Ho] = P[|Hi] = 1/2. 
Under these conditions, find 


PIA > Ao|Ho] 
expressing your result in terms of the error function erf(-) defined in Chapter 2. 
(Note: By Problem 10.13(c), A is a Gaussian random variable.] 
This problem concerns ergodicity for random processes. 


(a) State the general definition of “ergodic in the mean” for a wide-sense statio- 
nary process X(t). 

(b) Let X(t) be a wide-sense stationary Gaussian random process with zero-mean 
and correlation function 


Rxx(T) = oe 7! cos Qrfr, 


where o?, a, and f are all positive constants. Show that X(t) is ergodic in 
the mean. (Hint: You may want to use a sufficient condition for ergodicity.) 
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10.24 Assume that a WSS random process X(t) is ergodic in the mean and that the 
following limit exists: lim),)... Kx. x(T). Show that the correlation function Rx x(T) 
then has the asymptotic value lim);)..5 Rx x(T) = |x|" by making use of results 
in this chapter. 

10.25 Let the random sequence X[n] be stationary over the range 0 < n < oo. Define the 
time average 


Analogously to the concept of “ergodic in the mean” for random processes, we have 
the following: 

Definition. A stationary random sequence X[n] is ergodic in the mean if M[N] 
converges to the ensemble mean j1x in the mean-square sense as N — oo. 


(a) Find a suitable condition on the covariance function Kx x[m] for X[n] to be 
ergodic in the mean. 
(b) Show that this condition can be put in the form 


taf E (0-4) eata}-a 


n=—N 
(c) Is the stationary random sequence with covariance 
Kxx|m] = 5(0.9)!""! + 15(0.8)!™! 
ergodic in the mean? 
10.26 Let X(t) be a random process with constant mean jx and covariance function 
Kx x(t, s) = 07 coswo(t — s). Does this process have an orthogonal Fourier series? 
Why? Over which intervals? 


10.27 If the K—L integral Equation 10.5-3 has two solutions, ¢ (t) and ¢,(t) corresponding 
to the eigenvalues A; and Az, then show that if A; A 0 and Ag # A, we must have 


T/2 
/ $y (t)45(t)dt = 0. 


-T/2 
[Hint: Substitute for ¢,(t) in the above expression and use the Hermitian symmetry 
of K(t, s), that is, K(t,s) = K*(s,t).] 

*10.28 This problem concerns the convergence of a commonly seen infinite-series estimate 
for signals observed in additive noise. We have the observation consisting of the 
second-order and zero-mean random process X(t) over the interval [0,7], where 


X(t) = S(t) + N(t), with S LN, 


and we call S the “signal” and N the “noise.” Consider the K—L integral equation 
for the observed process X(t), 


T 
| Rx x(t, s)¢,,(s) ds = An@,,(t), for0<t<T. 
0 
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Denote the following three (infinite) sets of random coefficients, for n > 1: 


Xa= [xt (t) 7, (t) (at,s,= fo S(t oxlat.Nn= f N(OS; 


Further, denote by ot) the three sets of variance terms for these three sets of 


random coefficients.‘ We now want to make a linear estimate of the signal S(t) 
from the coefficients X,, as follows: 


~ @B 
i) = = Xn@, (4), 


and based indirectly, through the X,, on the observed random process X(t). The 
problem can now be stated as proving that this infinite sum converges in the mean- 
square sense. You should do this using Mercer’s theorem: 


Rx x(t, 8) => AnPn( 


and the fact that second order means that Rxx(t,t) < oo. Show Cauchy conver- 
gence of the partial sums over n. 

10.29 In this problem we use the K—L expansion to get LMMSE estimates of a Gaussian 
signal in white Gaussian noise. We observe X(t) on [0,7] where 


X(t) = S(t) +W(t), with S LW. 


Here W(t) is zero-mean and has covariance Kww(rT) = of/6(rT), and S(t) is zero- 
mean with covariance Kg9(t1, tz). 
(a) Show that any set of orthonormal functions {¢,,(t) } satisfies the K—L integral 
equation for the random process W(t). 
(b) Using part (a), show that the same set of orthonormal functions may be used 
for the K-L expansion of X(t) and S(t). 
(c) Show that for X,, = S, + W,, the LMMSE estimate of S;, is given as 


aw o 
ES, \x,|= = =X, 
o o 
where X,, S,, and W, are the K—L coefficients of the respective random 


processes. 
(d) Using the above, argue that 


a o> 
t)= 7 7 t 
aD re are 
[Hint: Expand S(t) in a K-L expansion, S(t) = >>, Sn@,(t).] 


tNote that Sp and Nn may not be the K—L coefficients of their respective processes. 
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10.30 A certain continuous-time communications channel can be modeled as signal plus 
an independent additive noise 


Y(t) = S(t) + N(t), 


where the noise process N(t) is Gaussian distributed with known K—L expansion 
{Ax, ,(t)}, where the A, are indexed in decreasing order, that is, A; > A2 > 
A3 >... > 0. It is decided to structure a “digital” signal as S(t) = M¢@,(¢), for 
some choice of k, say k = ko, and the discrete-value message random variable M can 
take on any of eight values 0,...,7 equally likely. Repeating this signaling strategy 
over successive intervals of length T, we can send a message random sequence M|[n] 
at the date rate of 3/T bits per second. We keep k = ko constant here. 


(a) The receiver must process the received signal Y(t) to determine the value of 
the message random variable M. Argue that a receiver which computes and 
bases its decision exclusively on, for all k > 1, 


T 
Gs - ¥ (toe (t) at, 


can safely ignore all Y/s except for Y;,,. Assume that M is independent of 
the noise process N(t). 
(b) How should we determine which values of ko to use? Is there a best value? 


*10.31 Derive the Karhunen—Loéve expansion for random sequences: If X[n] is a second- 
order, zero-mean random sequence with covariance Kx x|[n1, ng], then 


N41 
X[n] = > Xx6,[n] with |n| < N/2 and N even 
k=1 


+N/2 
where X= S* X{njoj[n] 
n=—N/2 
and E|X;,X7] = Ap O[k = 1 
+N/2 
and S$) é,{n|df[n] = 6[k - J). 
n=—N/2 


You may assume Mercer’s theorem holds in the form 


N+1 
Kxx[ni,ne] = SS Axx [ra] Px [Na], 


k=1 


which is just the eigenvalue—eigenvector decomposition of the covariance matrix 
Kxx with entries Ky x[i,j] for 1,7 = —N/2,...,+N/2. (Note: It may be helpful 
to rewrite the above in matrix-vector form.) 
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10.32 Let the zero-mean random process X(t), defined over the interval [0, 7], have covari- 
ance function: 


nt 27s Ant 4s 
Kxx(t,s) =3+2cos (=) cos (=) + cos (=) cos (=) : 


(a) Show that Kyx(t,s) is positive semidefinite, and hence, a valid covariance 
function. 

(b) Find the Karhunen—Loéve (K—L) expansion for the random process X(t) 
valid over the interval [0,7]. (Hint: Use Mercer’s theorem.) 

(c) Explain modifications to our general K-L expansion to include an X(t) with 
nonzero mean jx (t). How does the K-L integral equation change? 

(d) Is the X(t), with covariance given above, mean-square periodic? Explain 
your answer. 


10.33 Let the zero-mean, second-order random processes X1(t) and X(t) be defined over 
the time interval [0,7]. Assume that they have the same Karhunen—Loéve basis 
functions {¢,,(t)}, but different random variable coefficients X1,,, and X2,,, where 


Xin & i, X;(t)¢,,(t)dt. Consider now a new quantity, called the ensemble time- 


average correlation 


E 


ae 
i X(t) X(t) a ; 
0 


and find its equivalent expression in the KLT domain in terms of the KLT coeffi- 
cient cross-correlations E[X1,,X3,,,]. Don’t worry about existence and convergence 
issues here. 
10.34 Derive Equation 10.6-6 directly from the definition of the Hilbert-transform system 
function, 
H(w) = —j sgnw), 
where as usual, 7 = /—1 and sgn(w) = 1 for w > 0 and = —1 for w < 0. 

*10.35 If a stationary random process is periodic, then we can represent it by a Fourier 
series with orthogonal coefficients. This is not true in general when the random 
process, though stationary, is not periodic. Thus, point out the fallacy in the 
following proposition, which purports to show that the Fourier series coefficients 
are always orthogonal: First take a segment of length T from a stationary random 
process X(t). Repeat the corresponding segment of the correlation function peri- 
odically. This then corresponds to a periodic random process. If we expand this 
process in a Fourier series, its coefficients will be orthogonal. Furthermore, the 
periodic process and the original process will agree over the original time interval. 

10.36 Prove Theorem 10.6-3 by showing that 


da(t) = a exp(j2z font) 


are the Karhunen—Loéve basis functions for a WSS periodic random process. (Refer 
to Section 10.5 and note fo = 1/T.) 
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10.37 


10.38 


*10.39 


10.40 


10.41 


10.42 


10.43 


Let the complex process Z(t) 2x (t) + 7Y(t), where the real-valued random 
processes X(t) and Y(t) are jointly WSS. We now modulate Z(t) to obtain a 
(possibly narrowband) process centered at frequency wo as follows: 


UG) =ReiZgje oe"), 


Given the relevant correlation and cross-correlation functions, that is, Rx x(T), 
Rxy(rT), Ryx(r7), and Ryy(r), find general conditions on them such that the modu- 
lated process is also WSS. Also find the conditions needed on the mean functions 
fix and pty. Show that the resulting process U(t) is actually WSS. 

Show that if a random process is wide-sense periodic with period T, then for any f, 
E\|X (t+ T) — X(t)|?] = 0, that is, mean-square periodic. Then show, using Cheby- 
shev’s inequality, that also X(t) = X(t+T) with probability-1. 

Consider the bandpass random process, centered at wo, 


U(t) = X(t) cos(wot) + Y(#£) sin(wof), 


where X(t) and Y(t) are jointly WSS, with correlation functions satisfying the 
symmetry conditions of Equations 10.6-4 and 10.6-5. Show that the resulting band- 
pass random process U(t) is also WSS and find its correlation function in terms of 
the auto- and cross-correlation functions of X(t) and Y(t). Take random processes 
X and Y to be zero-mean.! 

Prove an analogous result to Theorem 10.6-3 for WSS periodic random sequences. 
(Hint: Perform the expansion for X[n] over 0 < n < T —1 where T is an integer. 
Only use T Fourier coefficients Ao,..., Ar—1.) 

(mizing condition) Let o4 S o(X{1], X[2],..., X[k]) be the sigma field of events 
generated by the random variables X[1], X[2],..., X[k] and o2 S o(X[k+n], X[k+ 
n+1],...),k >1, and n> 1 be the sigma field of events generated by the random 
variables X[k+-n], X[k+n+1],...,4 >1,andn > 1. Consider A € oj, and B € oo, 
and a sequence of numbers a, such that a, — 0 as n — oo. Then the random 
sequence {X|n]} is said to satisfy the mixing condition if |P(AB) — P(A)P(B)| < 
Q,. Give an example of a stationary random sequence that does not satisfy the 
mixing condition. 

A necessary condition for a stationary random sequence to be ergodic is that it 
contains no proper sub-ensemble of sample functions that is stationary. By proper 
is meant sequences with probability other than zero or one. Give an example of a 
stationary random sequence that does not satisfy this condition. 

Consider the weighted mean-square integral of the possibly complex valued random 
process X(t): 


74 - a(t) X (t)dt. 
0 


+Some helpful trigonometric identities are cos(a + 3) = cosacosZ = sinasin# and sin(a + ZB) = 
sinacos 3 + cosasin B. 
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We are given the integrable function a(t) and the correlation function Rx (ty, t2) of 
the random process. 


(a) Derive the condition for m.s. existence of this integral using the Cauchy 
convergence criteria. 

(b) Write an expression for the ensemble average power in the random variable I. 
Justify. 


10.44 Certain continuous time communications channel can be modeled as signal plus an 
independent additive noise 


R(t) = S(t) + N(t), 


where this noise N(t) is Gaussian distributed with known Karhunen—Loéve expan- 
sion {\,, ¢,(t)}, where the Ax, are indexed in decreasing order 41 > A2> ...Ap > 
... > 0. It is decided to create a digital signal as S(t) = M@,(t), for some choice 
of k = ko, where the discrete message random variable M can take on any of the 
eight values 0,...,7 equally likely. Repeating this signaling strategy over successive 
intervals of length T, we can thus create a message random sequence M[n] at the 
data rate 3/T bits per second. 


(a) The receiver must process R(t) for the purpose of determining the value of 
the message random variable M. Argue that a receiver which computes and 
bases its decision exclusively on, for k > 1 


T 
Ry = [ REoj(e at 
0 


can safely ignore all but R,,. Assume that M is independent of N(t). 
(b) What are the issues in determining which value of k, to use? Is there a best 
value? 


10.45 Let X(t) be a random process with constant mean j1x and covariance function, 
Kx(t,s) = 07 coswp (t — s) 


(a) Show that an m.s. derivative process X‘(t) exists here. 
(b) Find the covariance function of X’(t). 


10.46 This problem makes use of the fact from ordinary calculus that two-dimensional 
Riemann integrals exist over finite intervals if the integrand is a continuous function. 
We here assume that we have a random process X(t) whose correlation function 
Rx (t, s) is continuous for all ¢ and s. 


(a) Show that such an X(t) is m.s. continuous for all t. 
(b) Show that such an X(t) is m.s. integrable over finite intervals. 
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10.47 Consider the running integral 


defined over t > 0. 


(a) Find a condition for the m.s. existence of this integral in terms of the kernel 
function a(t,7) and the correlation function Rx (t, s). 

(b) Assuming it exists, find an expression for the correlation function of Y’(t), 
the derivative of the output Y(t), in terms of a(t,7) and Rx (t, s). 


10.48 Find the generalized m.s. differential equation that when driven by a standard white 
noise W(t), that is, mean 0 and correlation function (7), will yield the random 
process X(t) with psd 


w*+1 
(w? + 4) (w? + 9) 


Sx (w) = 
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Applications to Statistical 
Signal Processing 


This chapter contains several applications of random sequences and processes in the area of 
digital signal processing. We start with the problem of linear estimation and then present 
the discrete-time Wiener and Kalman filters. The E—-M algorithm is then presented, which 
can be used to estimate the model parameters used in these linear filters. A section on hidden 
Markov models follows. Estimating power spectral density functions, called spectral estima- 
tion, is presented next for both conventional and high-resolution (parameter-based) models. 
Finally we close with a section on simulated annealing, a powerful stochastic optimization 
method used for finding global optimal estimates for compound models. 


11.1 ESTIMATION OF RANDOM VARIABLES AND VECTORS 


The estimation of a random variable by observing other random variables is a central 
problem in signal processing, for example, predicting future weather patterns from current 
meterological data, or predicting global warming from the amount of CO2 in the atmosphere, 
or even predicting the amount of oil in an underground reservoir from sounding data. 
In this section, we consider this problem as well as predicting one random vector from 
observing another. We introduce the basic ideas in this section; subsequent development 
and application of these ideas will be taken up in later sections. 

To make clear what we mean by estimating one random variable from another, consider 
the following example: Let X, denote the barometric pressure (BP) and X2 denote the 
unknown rate of change of the BP at t = 0. Let Y denote the unknown relative humidity 
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one hour after measuring X = (X1, X2)". Clearly, in this case, X and Y are dependent 
RVs; then using X to estimate Y is a case of estimating one RV from another (actually X 
here is a random vector).! 

In terms of the axiomatic theory we can describe the problem of estimating one RV 
from another in the following terms: Consider an underlying experiment with probability 


space P 2 (Q,.%P). Let X and Y be two RVs defined on & For every ¢ € 2, we generate 
the numbers X(¢), Y(¢). Suppose we can observe only X(¢); how do we proceed to estimate 
Y(¢) in some optimum fashion? 

At this point the reader may wonder why observing X (¢) doesn’t uniquely specify Y (¢). 
After all, since X and Y are deterministic functions, why can’t we reason that X(¢) spec- 
ifies ¢ specifies Y(¢)? The answer is that observing X(¢) does not, in general, uniquely 
specify the outcome ¢ and therefore does not uniquely specify Y(¢). For example, let 
Q={-2,-1,0,1,2},X(¢) © ¢?, and Y(¢) © ¢. Then the observation X(¢) = 4 is asso- 
ciated with the outcomes ¢ = 2 or ¢ = —2 (of course, these may not be equally probable) 
and Y(2) = 2 while Y(—2) = —2. Hence all we can say about Y(¢) after observing X(¢) 
is that Y(¢) has value 2 or —2. If all outcomes ¢ € (Q are equally likely, then the a priori 
probability P[Y = 2] = § and P[Y = 2|X = 4] = 5. There may be other reasons why X 
cannot uniquely specify Y. For example, Y may have components that are unrelated to X 
such as, in the signal plus noise problem where Y = X + N and N is noise. 

Assume, at first, for simplicity that we are constrained to estimate Y+ by the linear 
function aX. Assume E[|X] = E[Y] = 0. Note that a generalization of this problem has 
already been treated in Example 4.3-4. The mean-square error (MSE) in estimating Y by 
aX is given by 


2A 


e = El(Y —aX)*] 


= of — 2a Cov[X, Y] + a?o% (111-1) 
Setting the first derivative with respect to a equal to zero to find the minimum, we obtain 
_ Cov[X,Y] 


. (11.1-2) 
oO a 
Equation 11.1-2 furnishes the value of a, which yields a minimum mean-square error (MMSE) 
in estimating Y with X if we restrict ourselves to linear estimates of the form Y = aX. We 


note in passing that the inner product? 
(Y —a,X,X) 2 El(Y — a,X)X] 
~0, (11.1.3) 


which suggests that the random error € eve a@oX is orthogonal to the datum. This inter- 
esting result is sometimes called the orthogonality principle and will shortly be generalized. 


+The abbreviation RV can mean random variable or random vector without ambiguity. 

¥All random variables in this section are initially assumed to be real. However, later in the discussion 
we shall generalize to include complex RVs. 

For a discussion of inner products involving random variables see Section 5.3. 
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Let us now remove the constraints of linear estimation and consider the more general 
problem of estimating Y with a (possibly nonlinear) function of X, that is, g(X) so as to 
minimize the MSE. Thus we seek the function g,(X), which minimizes 


e” = E[(Y — go(X))”]. 
The answer to this problem is surprisingly easy, although its implementation is, except for 
the Gaussian case, generally difficult. The result is given in Theorem 11.1-1. 
Theorem 11.1-1 The MMSE estimator of Y based on observing the random variable 
X is the conditional mean, that is, go(X) S Ely |X].* 
Proof We write g(X) as g(X) = go(X)+6g, that is, as a variation about the assumed 
optimal value. Then 
e” = E[(Y — E[Y|X] — 69)" 
= El(Y — E[Y|X])’| - 2E[(¥ — E[Y|X])69] 
+E[(59)"]. 
Now regarding the cross-term, observe that 
E(Y — E[Y|X])6g] = B[Y 6g] — E[EY|X]dg] 
= E{Y 6g] — E[Y 5g] 
= 0. 


We leave it as an exercise to the reader to show how line 2 was obtained from line 1. 
With the result just obtained we write 


e? = BUY — E[Y|X])"] + El(6g)?] 
> El(Y — E[Y|X})?| 
= Ein: a) 


So to make ¢? a minimum, we set 6g = 0, which implies that the MMSE estimator is 
go(X) = EIY|X]. 

The preceding theorem readily generalizes to random vectors? X and Y, where we wish 
to minimize the individual component MSE. It follows readily from Theorem 11.1-1 (the 
actual demonstration is left as an exercise) that 


N 
ein = >, EY — 9 (X)/7I, (11.1-5) 
i=1 


+The conditional mean estimator is a function of X and hence is itself a random variable. See the 
subsection Conditional Expectation as a Random Variable in Section 4.2. 

'In anticipation of the material in later sections, we let X,Y be complex random vectors. A complex 
RV X is written X = X; + Xi; where X; and X; are real RVs and represent the real and imaginary 
components of X, respectively, and 7 = /—I. The CDF of X is the joint CDF of X;, and Xj, that is, 
Fx,.x; (@r, i). 
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where ‘< 
gf (X) = E[Y;|X]. 


Thus, the MMSE estimator of a random vector Y after observing a random vector X is 
also the conditional mean that, in vector notation, becomes 


go(X) 2 E[Y|X]. (11.1-6) 


We can now generalize the orthogonality property observed in Equation 11.1-3. We show 
that this is a property of the conditional mean. 


Property 11.1-1 The MMSE vector ¢ of the vector Y given the random vector X, 
that is, 
e€ 2Y—- EY|X] 
is orthogonal to any measurable function h(X) of the data, that is, 


E|(Y — E[Y|X])h*(X)] = 0 (the asterisk denotes conjugation). (11.1-7) 


Proof Use the same method as in Theorem 11.1-1, which showed that the error was 
orthogonal to dg. Then generalize the result to the vector case. [i 


Example 11.1-1 
(linear optimal estimation) An RV Y is estimated by Y= die1 Xi, Where py = px, = 0 
fori=1,...,n. Assume E[X;X,] = 01 4 j, E[X?] = o?. Compute the a;,i = 1,...,n that 
minimize the MSE. Show that Y —}7}"_, a;X; is orthogonal to )7""_, a;X;. Assume real RVs 


Solution We wish to minimize 
ra 2 
e=E (y- Saux) 
i=1 


=o7- 2. a; Cov[X;, Y] + ys aa. 
i=1 i=1 


By differentiating e? with respect to each of the a;, we obtain 


Oe? " 
= —2Cov[X;, Y] + 2ajo; = 0 
Oa; 
or 
Cov[Xi, Y] A 
ay = — se = Qio- 


a 


To show the required orthogonality: 


(v a » oo% bs oo%i) = Ss (cr Y] sal Y] a?) 


E 
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Theorem 11.1-2 Let X = (X,...,Xw)? and Y be jointly Gaussian distributed 
with zero means. The MMSE estimate is the conditional mean given as 


Ely |X] = 5 a;X;, 


where the a,;’s are chosen such that 


N 
(yoann) x] =0 for k=1,...,N, (11.1-8) 
w=1 


which is called the orthogonality condition and is written 


N 
(y- Yann) 1X, k=1,...,N. 
i=l 


We see that this condition is a special case of Property 11.1-1. 


Proof The random variables 


N 
(v - Sax] ,X1,X2,...,XN 
= 


are jointly Gaussian. Hence, since the first one is uncorrelated with all the rest, it is inde- 
pendent of them. Thus, the error 
N 
Y—S > aX 
i=1 


is independent of the random vector K = (X1,...,Xw)", so 


N N 
i=1 i=1 
N 
— So a E[X 
i=1 


=0, since E[Y] = E[X;] = 0. 


-ll-S9k 


Ely |X] - Yar [X;[X] 


But 


Ely |X] -— Sak. 
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Hence 


Ely|X] = Yak. | (11.1-9) 


Theorem 11.1-2 points out a great simplification of the jointly Gaussian case, that is, 
that the conditional mean is linear and readily determined with linear algebraic methods 
as the solution to the orthogonality equations, which can be put into symbolic form as 


(Ya? x) 
or, in matrix form, as 
E\(Y —a’X)X"] =07 (+ denotes conjugate transpose). 
Hence the optimum value of a, denoted by ag, is given by 
al =kyxKx x, (11.1-10) 


where kyx © E[YX'] and Kxx © E[XX1). 

If the means are not zero, the answer is slightly more complicated. If X and Y are 
jointly Gaussian with means fx and jy, respectively, we can define the zero-mean random 
variables 


p > 
a en 
Then Theorem 11.1-2 applies to them directly, and we can write 


N 
E[Y,[Xe) = So a:X_i = E[(Y — wy) [Xd (11.1-11) 


i=1 
But the conditional expectation is a linear operation so that 
El(Y — py)|Xe] = BIY|Xe] - ny. (11.1-12) 


Let us observe next that 
E{Y|X.| = E[Y|X], (11.1-13) 


a result that is intuitively agreeable since we do not expect the conditional expectation 
of Y to depend on whether the average value of X is included in the conditioning or not. 
A formal demonstration of Equation 11.1-13 is readily obtained by writing out the definition 
of conditional expectation using pdf’s and considering the transformation X, = X — Ly; 
this is left as an exercise. Now using Equation 11.1-13 in Equation 11.1-12 and plugging 
this into Equation 11.1-11, we obtain our final result as 
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N 
E{Y |X] = S > aX ci t by 


i=1 
N 
=> aiG \ + jig, (11.1-14) 
i=1 


which is the general expression for the jointly Gaussian case when the means are nonzero. 
We see that the estimate is in the form of a linear transformation plus a bias. In passing, we 
note that the a;’s would be determined from Equation 11.1-10 using the correlation matrix 
and cross-correlation vector of the zero-mean random variables, i.e. the covariance matrices 
of the original random variables X and Y. 


More on the Conditional Mean 


Above we introduced the concept of the minimum mean-square error (MMSE) estimator 
of one random vector from another. Theorem 11.1-1 showed that this optimum estimator 
is equal to the conditional mean. Equivalently the MMSE estimator of the random vector 
X based on observation of the random vector Y is E[X|Y] which corresponds to Equation 
11.1-6, but with the roles of X and Y interchanged. Here X represents the transmitted 
signal and Y represents the observed signal, which is a corrupted version of X. In using 
the symbols this way, we follow the notation often used in the signal processing literature. 
The corruption of X can come through added external noise, or through other means such 
as random channel behavior. To extend the (scalar) conditional-mean result to (possibly 
complex) random vectors, we define the total MMSE by 


Emin = El|X — 2(¥)|7] 
N 
= D2 ElIXi — 9(Y)/"I, (11.1-15) 


and must now show that g;(Y) = E[X;|Y] minimizes each term in this sum of square 
values. Thus, we need a version of Theorem 11.1-1 for complex random variables. This can 
be established by first modifying the MSE in the proof of Theorem 11.1-1 to 


e” = E||X — E[X|Y] — 69(Y)]?] 
= E[|X — E[X|Y]|?] — 2Re E[(X — E[X|¥])5g*] + El]591"), 
and then proceeding to show that the cross-term is zero, using the smoothing property of 
conditional expectation, cf. Equation 4.2-27, thus obtaining 
= E[|X — B[X|Y]P] + Bll5g(¥)?}. (11.1-16) 


From this equation, we conclude that dg is zero just as before. Thus, g;(Y) = E[X;|Y] will 
minimize each term in Equation 11.1-15, thus establishing the conditional mean as the MSE 
optimal estimate for complex random vectors. 
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Assume that we observe a complex random sequence Y[n] over n > 0, and that we 
wish to estimate X[n] based on the present and past of Y|[n]. Based on the preceding vector 
results, we have immediately upon definition of the random vector, 


Y, 2 [Y[n], ¥[n—-1],...,Y[0]]7, 


that the MMSE estimate of X[n], denoted x [n], is 
Xn] = E[X[n]|¥ 


at each n. Such an estimate is called causal or sometimes a filter estimate since it does not 
involve the input sequence Y at future times m > n. For infinite-length observations Y, one 
can define the estimate of X[n] based on the vector yo) 2 [Y[n], Y[n—-1],...,¥[n-—N]]* 
and then let N go to infinity to define the mean-square limit under appropriate convergence 
conditions, 

lim E[X[n]/¥{"), (11.1-17) 


N—-oo 
thereby defining the expectation conditioned on infinite-length sequences. One can show that 
this limit exists with probability-1 by using Martingale theory (cf. Section 8.8.). We can 
show (see Problem 8.55) that for each fixed n, the random sequence with time parameter N, 


GIN] 2 E[X[n][yo] (11.1-18) 


is a Martingale. We thus can use the Martingale Convergence Theorem 8.8-4 to conclude 
the probability-1 existence of the limit (Equation 11.1-17) if the variance of the random 
sequence G/N] is uniformly bounded in the sense of the theorem; that is, for all N > 1, 


CoN] <C<oo for some finite C. (11.1-19) 


In fact, this variance is bounded by the variance of X[n]. We leave the demonstration of 
this result to the reader (Problem 11.4). 

Similar expressions can be obtained for random processes under appropriate continuity 
conditions. For instance, by conditioning on ever more dense samplings of a random process 
Y (t) over ever larger intervals of the past, one can define the conditional mean of the random 
process X(t) based on causal observation of the random process Y(t), 


E(X(t)|¥(r),7 < ¢d]. (11.1-20) 


We also learned that when the observations Y are Gaussian and zero-mean, the conditional 
mean is linear in these conditioning random variables (compare to Theorem 11.1-2). For 
the case of estimating the random vector X from the random vector Y, this becomes 


E[X|Y] = AY, (11.1-21) 
where the coefficients A;; are determined by the orthogonality conditions 


(X-AY) LY 1<i,k<N, (11.1-22) 
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which is just the orthogonality condition of Theorem 11.1-2 applied to the estimation of 
each component X; of X. The proof of this theorem follows the same reasoning as before 
with the proper definition of the complex Gaussian random vector. However, the definition 
is somewhat restrictive and the development would take us too far afield. The interested 
reader may consult [11-1] for the complex Gaussian theory which is often applied to model 
narrowband data, such as in Section 10.6. 


Orthogonality and Linear Estimation 


We have seen that the MMSE estimate is given by the conditional mean which is linear in 
the observations when the data is jointly Gaussian. Unfortunately, in the non-Gaussian case 
the conditional mean estimate can be nonlinear and it is often very difficult to obtain. In 
general, the derivation of an optimal nonlinear estimate will depend on higher-order moment 
functions that may not be available. For these reasons, in this section we concentrate on the 
best linear estimate for minimizing the MSE. We denote this estimate as LMMSE, standing 
for linear minimum mean-square error. Of course, for Gaussian data the LMMSE and the 
MMSE estimators are the same. Sometimes we will use the phrase optimal linear to describe 
the LMMSE estimate or estimator. 

Consider the random sequence Y[n] observed for n > 0. We wish to linearly estimate 
the random signal sequence X[n]. We assume both sequences are zero-mean to simplify the 
discussion, since the reader should be able to extend the argument to the nonzero-mean 
case with no difficulty. We denote the LMMSE estimate by 


E[X[n]|Y [n],...,Y[0]], (11.1-23) 


where the hat on E distinguishes this linear estimate from the nonlinear conditional mean 
estimate E[X[n]|Y[n],...,¥Y[O]]. For the moment, we will treat Equation 11.1-23 as just 
a notation for the LMMSE estimate, but at the end of the section we will introduce the 
E operator. The following theorem establishes that the LMMSE estimate for a complex 
random sequence is determined by the orthogonality principle. 


Theorem 11.1-3 The LMMSE estimate of the zero-mean random sequence X[n| 
based on the zero-mean random sequence Y [nl], is given as 


E[X(n] |Y(n),..., Yl = So af Yea, 
4=0 


n)> 


where the ak s satisfy the orthogonality condition, 


er Soave 
4=0 


Furthermore, the LMMSE is given by 


tl Y[k, O<k<n. 


= = F(X] 7] —} al” ELY [i] X* [nl]. (11.1-24) 
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Proof Let the al” ’s be the coefficients determined by the orthogonality principle and 


let oi”) be some other set of coefficients. Then we can write the error using this other set of 
coefficients as 


X[n] — Soo Yq = xt — Soa [i] 
i=0 4=0 


“S /_(n) _ p(n)\ yrs 
+d (a o} ) Yl 


where we have both added and subtracted Sal Y [i]. Because the first term on the right of 
the equal sign is orthogonal to Yi] for i = 0,...,n, we have 


x Soave L 3 (a — 2”) Y [i], 
1=0 1=0 


which implies 


n 2 n 2 
E ts -S “oe yf] | =F ts — Soa Yi] 
1=0 1=0 
Ki 2 
+E }|S (al? — 0 )¥ (i 
i=0 
n 2 
SB xt — Soa Yi] 
1=0 


with equality if and only if al” a ys” fori =0,...,n. To evaluate the MSE, we compute 


We note also that 


and thus by this orthogonality we have 


n 


eo? = El|X(n]P] — SO af ELV LX" [nl] 
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Proceeding to solve for al”, we suppress the superscript (nm) for notational simplicity and 
write out the orthogonality condition of the preceding theorem: 


E[X(n]Y*|k]] = 3 a:E|Y[JY*[k]], O<k<n. (11.1-25) 
1=0 


We define the column vector, 


and row vector, 


kxy © E[X[n]¥* [0], X[n]¥“[1],...,X[n]¥*[n]] 


=> [Kxy|[n, 0], nase Kxy|n,n]] 
and covariance matrix 
Kyy = E[YY"], with random column vector Y S (10) .5¥ alls 


where 
(Kyy)ij = ELY@Y"[]] 
= Kyy|i, Jj]. 

Then Equation 11.1-25 becomes 

kxy =a'Kyy, 
with solution 

a’ =kxyKy,. (11.1-26) 
The MSE of Equation 11.1-24 then becomes 


e? 4. = 0%|[n] —kxyKyLkly. (11.1-27) 


One comment on the MSE expression is that the maximum possible error output from 
an LMMSE estimator is 0%[n], which happens when the al”s equal zero. The latter is 
optimal when there is no cross-covariance between the observations and the signal X [n]. 
Any nonzero cross-covariance causes the MSE to decrease from o%[n] down to the value 


given in Equation 11.1-27, the amount of decrease given by 
kxyKytkhy. 
We next look at an example of the above linear estimation procedure applied to the problem 


of estimating a signal contaminated by white noise. 


Example 11.1-2 
(estimation of a signal in noise) Assume we have a random signal sequence X[n], which is 
immersed in white noise V[n] of variance o?,, where the signal and noise are uncorrelated 
and zero-mean. Let the observations be for n > 0, 
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Y [n] = X[n] + V[n] xX lv. 


F 


We want to determine the causal LMMSE estimate X [n], 


X[n] = E[X[n]|¥[n],-.., Y[0]] 


= 3 as” [a], 
4=0 


in terms of the covariance function of the signal X and the variance of the white noise V. 
From Theorem 11.1-3 the coefficients of the optimal linear estimator al” are determined by 
the orthogonality conditions (Equation 11.1-25) specialized to this example. The solution 
is thus Equation 11.1-26, which must be solved for each value of n > 0. It remains to 
determine the covariance matrix Kyy and cross-covariance vector kxy. Looking at the 
ijth component of Kyy we compute 


(Kyy)iy = E[Y [YU] 
= E[X) xl] + BV EV" tl] 
= (Kxx)ij + oy 5ij- 
For the kxy we obtain 
(kxy)i = E[X[n]¥"[a]] = E[X[n](X [J + VU)" 
= E[X[n]X"[i] 2 (kxx)i,  OSi <n, 
since the signal is orthogonal to the noise. Thus, we obtain the estimator coefficient vector 


al = kyx(Kxx + gl. 


A special case that allows considerable simplification of this example is when the signal 
covariance is diagonal; that is, X[n] is also a white noise, 


Ke Set and kxx £02,[0,0,...,0,1]. 


Then the solution for the coefficient vector is 


o2 
a’ = ~~ (0,0,...,0,1], 
OXY toy 


(n 
a 


which means that a‘” = 0 except for i = n, so that the LMMSE estimate is 


Xn] = [0% /(0% + of IY (nl. 


Actually this special case arose when we estimated the coefficients of a Karhunen—Loéve 
expansion of a signal process from the corresponding expansion coefficients of the random 
signal process with additive white noise in Example 10.5-2. 


712 Chapter 11 Applications to Statistical Signal Processing 


In general, the signal random sequence is not white and therefore the al” 


and the estimate takes the growing-memory form: 


are not zero 


It is called growing memory because all past data must be used in the current estimate. 
The MSE is given by the formula 


n 
Bl|X{n] — X{n}]?] = o%[n] — Y) al” Ky xlé,n]. 
i=0 
We next turn to the application of these results to the prediction problem for random 
sequences. 


Example 11.1-3 
(Normal equations for linear prediction) Let X(t) be a real-valued zero-mean random 
process. Assume that we wish to predict X(tn+1) from a linear combination of n previous 
observations X (tn), X(tn-1), X (tr— 2),---,X(ti), where tn41 > th > tn-1 >... > th. 
Denote the predicted value by X(t is). Then near prediction implies that X(t mA) = 
yo GX (ti) and that the prediction error is € = X(tn4i1)— X(t n+1). To find the LMMSE 


we adjust the coefficients {c;} so that the mean-square error €? 2B (e|=£ [Xx (tn4i) — 
Soe GX (t:)|?] is minimized. The minimum can be computed from 


be" /0e, 0, for 7 =1,2,...,% 


Indeed, carrying out the differentiation furnishes a specialization of the orthogonality prin- 
ciple, namely, 
EEX (t;)) =0 for j=1,2,...,n 


So, as we already know, the data must be orthogonal to the error. The optimum coefficients 
satisfy 


n 


Rxx(t Ae )- SoG Rxx (tit j= 0 for j=1,2,. 


i=1 


In the WSS case, these equations take the form 
Rxx(tn4i - )~ Laktxx( (t; —t;) =0. 


Usually the sampling intervals are equally spaced so that tn41 —t; = (n+1— )At. Letting 
At = 1, we obtain the Normal equations 


Rxx(l) = SoG Rxx(l — i) => 0, 
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where the c; are in inverse order from the c;, that is, c, = Cy41_;. The solution to these 
equations can be obtained efficiently by the Levinson—Durbin algorithm, which exploits 
the special symmetry properties of the correlation or covariance matrix. The algorithm is 
discussed in advanced texts on signal processing, for example [11-2]. 


Example 11.1-4 
(linear prediction for a first-order WSS Markov process) For the real-valued, first-order 
Markov process consider an increasing sequence of time t; < tg <... < t, for some posi- 
tive n. Then fe ta as Ln—1;En—2;--+;F15tn415 tn, tn-1,--- ti) = Fx(@n41|£n} tn+1s ty) 
and, hence, the predictability of X(tn41) depends only on observing X(t,,). Thus, in this 
case, X (tn41) = cX(tn), with prediction error € = X(tn41)—cX (tp). The Normal equations 
for this case are 


Axx (tay — ty) =cRxx (ty — 7), 9 = 1p 2h..05% 


: A : : : : 
With 7; = ti4, —t; > 0, i = 1,2,..., m, we can rewrite the Normal equations in a more 
revealing form as 


Rxx(T,) =chxx(0), g=n 


Rxx(Ta + Tn-1) = CRxx(Tn-1), jan-1 


Rxx(Tn + Tn-1 ar Tn—2) =cRxx(Tn-1 + Tn_2); j =n-2 


From these equations we can establish a form for the correlation function of a zero-mean, 
WSS, first-order Markov process. For example, if we divide the first equation by the second, 
we readily obtain 


Rxx(tn)Rxx(tn-1) = Rxx(0)Rxx(Tn + Tr-1) 


which is satisfied by the form Ryx(rT) = bexp[tar]. Now using the general results that 
Rxx(0) > |Rxx(r)| and Rxx(rT) = Rxx(-7), we obtain that 


Rxx(rT) = bexp(—alt|), —-oo <T <o,a>0,b>0, 


which implies a first-order process. 


Example 11.1-5 
(optimum linear interpolation) In Example 8.4-7 we constructed a sequence in which every 
other sample had value zero. The sequence so constructed retained all of the original samples, 
but introduced zeros between each sample, thus making the sequence twice as long. In 
practice, interpolated values replace the zeros to eliminate expansion artifacts. There are a 
number of ways to construct these interpolated samples. One is to pass the sequence through 
an appropriate low pass filter whose impulse response is the interpolating function. Another 
is to create the interpolated sample x [n] by taking the sample mean of the adjacent samples 
on either side, that is, X¥,[n] = sX(n—1]) + $X[n+]]. 
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A third is to apply optimum linear interpolation. We compare the last two methods 
here. In the case of optimum linear interpolation the orthogonality principle requires that 


E (Fn — X{n})X[n— i] =0 
E (Fin ~ X[n])X[n + i] =f, 


where 


X(n] = aX[n— 1] + bX[n 4] 
and the coefficients a,b must be determined so that the orthogonality conditions are satis- 


fied. We denote these (optimum) coefficients by ao, bo. From the orthogonality equations, 
in the real-valued case, we obtain that 


Pole 


which yields the solution 


is)=[ecse) accel) [eect 


or do = bo = Rxx([1|/(Rx x [0] + Rx x[2]). For any estimator the estimation error is E[n] = 
X[n| — X[n] and the mean-square error €2 can be written as e2 = E[E2[n]], where e? doesn’t 
depend on n in the WSS case. After some simple algebraic manipulations we obtain, for 
this real-valued case, 

2R% x[I] 
Rxx(0] + Rxx[2] 


oe = Rxx(0] 


for the optimum interpolator, and 


2? = Rxx([0] - (2RxxUl esl feel) 


for the sample mean interpolator. To show that <% < e? we must show that 


2R% x (1 > (2exxil Rxx(0] + Rxx[2] ii 


Rxx(0] + Rxx[2] 2 


But this result is equivalent to writing 


fr = Axx Pl)! i, 


which is always true. 


Example 11.1-6 
(numerical example of error reduction by optimal linear interpolation) In Example 11.1-5 we 
showed that optimum linear interpolation yields a smaller mean-square error (MSE) than 
the sample mean interpolator. How much smaller? Assume that the psd of the sequence 
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X([n] is Sxyx(w) = W(2w/7), where w is a normalized frequency such that |w| < m and 
W (a) is the rectangular window function 


_fi fejsi 
wea) = {9 else. 


Then 
__ sin(mm/2) 
7 mn 
Thus, a9 = bo = Rxx[1]/(Rxx [0] + Rxx[2]) = 2/7. Then e2 © 0.0947 while «2? = 0.113 
and estimating X|[n] by wx (which is zero) would yield a MSE of 0.5. The fractional percent 
improvement in the MSE (optimum) compared to the MSE (sample-mean interpolator) is 

2_ 22 
100 x 2—*2 = 19%. 

€ 


oO 


Example 11.1-7 
(one-step predictor) If we set 


Rxx[m] and Rxx(0] = 0.5, Rx x [1] = 1/n, and Rxx[2] =0, 


Y[n] 2 X[n—- 


for all n, we can use the result of Theorem 11.1-3 to evaluate the LMMSE estimate 
E(X[n]|X|n — ]],..., X[0]], 
which is called the LMMSE one-step predictor for X[n]: 


X[n] = 3 al” X[i). 
i=0 


Specializing our results, we replace n by n — 1 in Equation 11.1-25 and obtain 


ah = kyxK xy 
where in this case kyx = (Kxx[n, 0], Kxx[n,1],..., Kxx[n,n —1]) and Kx x is given by 


Kxx(0, 0] ++ Kxx(0,n — 1] 


Kxx = : : 
Kxx([n— 1,0] + Kxx|[n—1,n-1] 


Example 11.1-8 
(computing estimators) Let ux = 0 and Kxx|[m] = 03 pl+o2 om, Given the observations 
Y [n] = X[n] + W[n], where the mean and variance of W[n] are 0 and o4,, respectively, find 
the following linear estimators. 


(a) First we find the single observation conditional mean E [x [n] IY [nl] . Since the random 


sequences in question are WSS, we can set n = 0 in Theorem 11.1-3 and calculate 
E|X[0]|Y[0]] via Equation 11.1-26 with n = 0 to obtain 
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a= Kxy[0|Kyy [0] 
E|X[0](X [0] + W[0))] 
of +o3+o4, 


(b) Next, we find the two-point observation conditional mean E [X[n]|Y¥[n], Y[n — 1]]. 
Here we apply Theorem 11.1-3 with n = 2 to calculate E[X[2]|Y[1], Y[0]] for the 
WSS random sequences X and Y. Using Equation 11.1-26 we obtain 


a 2 2 T/ 9 2 2 2 2 = 
a) _ ay +3 Oy + Og + Ow FP, + O3Po 
= 8 2 2 2 eee aie 
a2 O71 P1 + 92/2 OTP, + 99P2 91 + OR + OW 


In the WSS case, the matrix Kxx is Toeplitz; that is, 


(Kxx)ij = gli — J] 


for some g since the covariance depends only on the difference of the two time parameters. 
Efficient algorithms exist for computing the inverse of Toeplitz matrices, which allow the 
recursive calculation of the coefficient vectors a”) for increasing n. Such an algorithm 
is the Levinson algorithm, which is described in [11-3, pp. 835-838]. Linear prediction, 
as the foregoing is called, is widely used in speech analysis, synthesis, and coding [11-3, 
pp. 828-834]. 

One difficulty in the preceding approach is that the resulting predictors and estimators, 
though linear, nevertheless exhibit growing memory requirement except in the simplest 
cases. We will overcome this problem in Section 11.2 by incorporating a Markov signal 
model. We first pause briefly to discuss the properties of the LMMSE operator we have just 
derived. 


Some Properties of the Operator EB 


We have introduced the symbol Ein Equation 11.1-23 for the LMMSE linear estimator. 
Here we regard EB as an operator and establish certain linearity properties of this operator 
that will be useful in the next section. We will use this operator later on to simplify the 
derivation of important results in linear estimation. 


Theorem 11.1-4 The operator E has the following linearity properties: 
(a) BLXy + X2|¥] = ELX,[Y] + ELXQIY] 


and 
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(b) when Y, and Y2 are (statistically) orthogonal then 
E[X|¥1, Yo] = E[X[¥a] + E[X[Y9]. 
Proof To prove (a) we note that 


E[X|Y] =cTY 


Fl 


where the vector c is given as 
T= -1 
c =kyyKyy. 


Clearly c = c; + cp since kyy = kx,y + ky, y; thus, 
eFY¥ =c?l Y+4+c7Y. 


To show (b) we note that since Y, and Y, are statistically orthogonal; that is, E[Y,Y4] = 0, 
which we write as Y; | Yo, then the statistical orthogonalities of the individual estimates 


(X—cTY,) LY, and (X-—c?Y.) LY. 


imply that 
(X —cTY,—c3 Y2) | both Y; and Yo, 


which can be seen from Figure 11.1-1. Then (b) follows. Here cy = kxy,Ky'y, and 


-1 
C2 > kxy,Ky.y,- Hi 


With reference to Figure 11.1-1, we see that the operator B projects the signal X onto 
the linear subspace spanned by the observation vectors Y;,2 = 1,2. Thus, EF’ is sometimes 


A 
E[X1Y,,Yol (X — chY,-chY,) 
Y; Y, L Plane (Y,,Y>) 
Figure 11.1-1 Illustration of orthogonal projection. The random variable X is shown as a “vector” in 


the Hilbert space of random variables (introduced in Section 10.1). 
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referred to as an orthogonal projection operator. Geometrically, it is clear that such an 
orthogonal projection will minimize the error in an estimate that is constrained by linearity 
to lie in the linear subspace spanned by the observation vectors. Property (a) then says 
that the orthogonal projection of the sum of two vectors is the sum of their orthogonal 
projections, a result that is geometrically intuitive. Property (b) says that the orthogonal 
projection onto a linear subspace can be computed by summing the orthogonal projections 
onto each of its orthogonal basis vectors. This property will be quite useful in the next 
section on linear prediction. 

All this reinforces the Hilbert or linear-space concept of random variables introduced 
in Section 10.1, where we defined the RV norm 


rn 
|X? = BUX?) 


and inner product 
(X,Y) = E[XY"*}]. 


This linear vector space of random variables must be distinguished from the random vectors 
X and Y. To emphasize this difference we sometimes say “statistically orthogonal.” 


11.2 INNOVATION SEQUENCES AND KALMAN FILTERING 


In this section we look at the use of signal models to avoid the growing memory aspect of 
the prediction solution found in the last section. We do this by introducing a certain signal 
model, the vector difference equation driven by a white random input sequence W/[nl, 


X[n] = AX[n — 1] + BWI{n], n> 0, (11.2-1) 
with X|[—1] = X_, given and 
Rww(m] = E[W[m + n|W' [n]] = o2,6[m], 


and W/[n] orthogonal to the past of X[n], including the initial condition X_,. In symbols 
this becomes 


W[n| L X{[m] for m<n and Win] L X_1 for n>0. 


If W[n] is Gaussian, then by Theorem 8.6-1 X[n] of Equation 11.2-1 is vector Markov. Using 
a technique of Example 8.6-2, any scalar LCCDE? driven by white noise can be put into the 
form Equation 11.2-1. The resulting matrix A will always be nonsingular if the dimension 
of the state vector is equal to the order of the scalar LCCDE. Thus, we will assume A is a 
nonsingular matrix. We will also assume that matrix B is nonsingular. 


*LCCDE here stands for Linear Constant Coefficient Difference Equation. 


Sec. 11.2. INNOVATION SEQUENCES AND KALMAN FILTERING 719 


Starting from the initial condition X_, we can recursively compute the following 
forwards-in-time or causal solution, 


X[0] = AX_; + BW(0], 
X[1] = AX(0] + BW{1] 
= A2X_, + ABW(0] + BW(I], 


X[2] = A®°X_, + A2-BW(0] + ABW/[]1] + BWLQ], 


and so forth. 
We thus infer the general solution 


X[n] = $0 A” BWIk] +A" 7X 4, (113-2) 
k=0 


the first term of which is just a convolution of the vector input sequence with the matrix 
impulse response, 
H[n] 2 A”Bulr]. 


It is important to note that Equation 11.2-2 is a causal and linear transformation from 
the space of input random sequences W[n] (including X_1) to the space of output random 
sequences X[n]. Indeed, if we use Equation 11.2-1, since B~! exists, we can also write the 
input sequence as a causal linear transformation on the output random sequence, 


W[n] = B7'[X[n] — AX[n — 1]], n> 0. 


We say the white sequence Wn] is causally equivalent to X[n] and call it the innovations 
sequence associated with X. The name comes from the fact that W contains the new 
information obtained when we observe X[n] given the past [X[n — 1], X[n — 2],..., XO]. 
This statement will subsequently become more clear. In general, we make the following 
definition. 


Definition 11.2-1 The innovations sequence of a random sequence X([n] is defined to 
be a white random sequence, which is a causal and causally invertible! linear transformation 
of the sequence X[n]. I 


It follows immediately that we can write 
E[X[nJ]|X[n — 1],...,X[0], Xa] = E[X[n]|W[n - 1],..., W[0], X-1], 


because the LMMSE estimate can always undo a causally invertible linear transformation. 
That is, the required inverse can, if needed, be part of the general causal linear transforma- 
tion FE’. To see the benefit of the innovations concept, consider evaluating 


E[X[n]|W[n — 1],..., W[0], X_4). 


+The term “causally invertible’ means that the linear transformation has an inverse which is causal. 
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Rewriting Equation 11.2-2 by isolating the k = n term in the sum, we see that 


n—1 
X[n] = 50 A” *BW[k] + BW[n] + AP*1X 4. 
k=0 


Applying the E operator, and using linearity property (a) of Theorem 11.1-4, we obtain 
E[X[n]|W[n — 1J,..., W[0], Xa] 


n—1 


= >> A” *BE[W[k]|W[n — 1], ..., W[0], X11] 
k=0 


+ BE[W[n]|Win — 1],..., W[0], X_1] 


+ A"+1EX_j|W[n- 1],..., W[0],X_4]. 


Then repeatedly using linearity property (b) of the same theorem, since the innovations 
[W[n — 1],..., W[0], X_1] are orthogonal, we have 


n—-1 
E[X[n]|W[n — 1],..., W[0], Xa] = S> A" *BW/k] + APEX 1. (11.2-3) 
k=0 
From Equation 11.2-2 we also have 
n—-1 
X[n-1]) = 5) A™* *BW[k] + A"X_1, 
k=0 
so combining, we get 
X[n] = AX[n — 1]. (11.2-4) 


This is the final form of the LMMSE predictor for the state equation model with a white 
random input sequence. (Note: We are assuming that the mean of the input sequence is 
zero. This is incorporated in the white noise definition.) The overall operation is shown in 
Figure 11.2-1. 

Equation 11.2-4 can also be derived by applying the EB operator directly to 
Equation 11.2-1 and using the fact that X{n — 1JW)[n]. For example, E[X[n]|X[n — 1]] = 
E[AX[n — 1] + BW[n]|X[n — 1]] = E[AX[n — 1]|X[n — 1]] + E[BW{n]|X[n — 1]]. But as 
W{[n]  X[n—1], it follows that E[X[n]|X[n—1]] = E[AX[n—-1]|X[n—1]] = AX[n—1]. We 
can thus view Equation 11.2-1 as a one-step innovations representation in the same sense 
that Equation 11.2-2 is an (n+ 1)-step innovations representation. The innovations method 


X {nl Innovations LMMSE x [nl 
transformation WInl predicator 


Figure 11.2-1 Innovations decomposition of LMMSE predictor. 
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is of quite general use in linear estimation theory, the basic underlying concept being a 
representation of the observed data as an orthogonal decomposition. We will make good 
use of the innovations method in deriving the Kalman filter below [11-3]. 

With reference to our state equation model (Equation 11.2-1) we note that if we added 
the condition that the driving noise W[n] be Gaussian, then X[n] would be both Gaussian 
and Markov. In this case the preceding LMMSE estimate would also be the MMSE estimate; 
that is, E would be E. This motivates the following weakening of the Markov property. 


Definition 11.2-2 A random sequence X[n] is called wide-sense Markov if for all n 
E[X[n]|X[n — 1], X[n — 2],...] = B[X[n]|X[n — 1]. 


We note that this definition has built in the concept of limited memory in that the LMMSE 
prediction cannot be changed by the incorporation of earlier data. From Equation 11.2-4, 
it follows immediately that the solution to Equation 11.2-1 is a wide-sense Markov random 
sequence. One can prove that any wide-sense Markov random sequence would satisfy a 
first-order vector state equation with a white input that is uncorrelated with the past 
of X[n]. 


Theorem 11.2-1 Let X[n] be a wide-sense Markov zero-mean sequence. Then there 
exists an innovations sequence W[n] such that Equation 11.2-1 is satisfied and the sequence 
W is orthogonal to the past of X. 


Proof Since X is wide-sense Markov and zero-mean, 
E{X[n]|X[n — 1], X[n — 2],...] = E[X[n][X[n — 1] 
4 AX[n— 1], 


where A is defined as the matrix of the indicated LMMSE one-step prediction. Next we 
define W as 


W([n] 2 X[n] — AX[n— 1]. 
It then follows that E[W/[n]] = 0 and 
Win] L (X[n — 1], X[n —2],...) 


because of the fact that X is wide-sense Markov and the prediction error must be orthogonal 
to the past data used in the prediction. Furthermore, 


E({W[n]W"[m]] =0 for m<n, 


since W[m] is a linear function of the past and present of X[m], which is in the past of X[n] 
for m <n. Thus, 
E[W[n] W' [ml] = o8yln]é[m — nl, 


where 
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Example 11.2-1 
(scalar random sequence of Markov order 2) Consider the second-order scalar difference 
equation driven by white noise, 


X([n| = aX[n —- 1) 4+ BX[n —2])+ Wn], 


where X[—1] = X[—2] = 0 and where W[n] has zero-mean and variance Var[W[n]] = oj). 
To apply the vector wide-sense Markov prediction results, we construct the random vector 


X[n] = [X[n], X[n — 1]]7 


and obtain the first-order vector equation 


X[n] = AX[n — 1] + bW In] 


Afa B Afl 
aa(* 2) aa v8(2), 


Using Equation 11.2-4 we then have 


on setting 


E[X[n]|X[n — 1],...,X[0]] = AX[n — 1], 


a|(A",) tea... = & i) Glas 


X([n] = aX[n — 1] + BX[n — 2]. 


so that 


Sometimes such a scalar random sequence is called wide-sense Markov of order 2. More 
generally a pth order, scalar LCCDE driven by white noise generates a scalar wide-sense 
Markov random sequence of order p. (cf. Definition 8.5-2.) 


Predicting Gaussian Random Sequences 


Here we look at the special case where the random sequence is Gaussian so that the results 
of the last section having to do with wide-sense Markov become strict-sense Markov and 
the orthogonal random sequence W[n] becomes an independent random sequence. We start 
with a theorem, which is in part a restatement and in part a strengthening of some of the 
previous results. 


Theorem 11.2-2 Let X[n] be a zero-mean, Gauss-Markov random sequence. Then 
Xn] satisfies the difference equation 


X[n] = A,X [n — 1] + B, W[n] 


for some A,,,B,, and white Gaussian, zero-mean sequence W[n]. 
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Proof Since X[n] is Gaussian, the conditional mean E[X[n]|X[n — 1], X[n — 2],...] 


is a linear function of [X[n — 1], X[n — 2],...] because in the Gaussian case this MMSE 
estimator is linear. Since the random sequence is also Markov we have that 
E[X[rJ|X[n - 1],--.] = EX fn] [X{n = 1] 
4A, X[n—- 1], 


for some A,, that may depend on n. In fact, we know the matrix A,, can be determined by 
the orthogonality relation 


(X[n] — A, X[n —1]) L X[n— 1]. 


What remains to be shown is that the prediction-error sequence X[n]— A, X[n—1] is a white 
Gaussian random sequence. First, we know it is Gaussian because it is a linear operation 
on a Gaussian random sequence. Second, we know that the prediction error is (statistically) 
orthogonal to all previous X[k] for all k <n. Thus, 


(X[n] — A,X[n — 1]) 1 (X[A] — A,X[k — 1)) 


for all k < n. Hence it is an orthogonal random sequence, but this is the same as saying it 
is white and zero-mean. Thus the proof is completed by setting 


Bow |n]Bi, S B[(X{[n] — A, X[n — 1))(X[n] - A, X[n — 1))7). 
In fact, we can just as well take B,, =I‘ and then 
ow|n| = B[(X[n] — AnX[n — 1])(X[n] — AnX[n — 1])*] 
= E[X[n]X* [n]] — AnE[X[n — 1X" [n] 
= Kxx([n, n| = A, Kxx|[n = 1; n|, 


but 
An = Kxx[n,n — 1])Kxx[n —1,n—-]] (112-5) 
so 


a2, [n] = Kxx[n,n] — Kxx[n,n — 1K x(n —1,n-1)Kxx[n-1,n]. 


Note that the difference between Theorem 11.2-1 and Theorem 11.2-2 is that in the latter 
theorem we are assuming that the vector random sequence X[n] is Gaussian, while in the 
former this condition is not assumed. The sequence W/[n] is then an independent random 
sequence and is Gaussian when X is Gaussian as well as Markov. 

We also can note that if X[n] were also stationary in Theorem 11.2-2, then the coefficient 
matrices A, and B,, would be constants, and the innovations variance matrix would be 
constant o%,. Finally note that since we use the expectation operator E in Theorem 11.2-2 
but only the LMMSE operator E in Theorem 11.2-1, the representation in the Gaussian 
case is really much stronger than in the LMMSE case. 


!Thus ensuring that B~! exists where it is useful, such as for ensuring the innovations property of 
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Kalman Predictor and Filter 


Here we extend the prediction results developed in the last section by enlarging the class 
of applications to include prediction from noisy data. This generalization when combined 
with the Gauss-Markov signal model will result in the celebrated Kalman-Bucy prediction 
filter. In 1960 R. E. Kalman published the discrete-time theory [11-4]. A year later the 
continuous-time theory was published by R. E. Kalman and R. S. Bucy [11-5]. Actually, 
we do not really need the Gauss-Markov assumption. We could just assume the signal is 
wide-sense Markov and derive the LMMSE filter. The result would be the same as what we 
will derive here essentially because the MMSE filter is linear for Gaussian data. 

We will assume that the Gauss-Markov sequence to be predicted is stationary with 
zero-mean and is defined for n > 0 with known initial condition X[—1] = 0. We will also 
restrict attention in this section to real-valued random vector sequences. By Theorem 11.2-2 
we have that the Gauss-Markov signal can be represented as 


X[n] = AX[n -—1] + BWiIn], n> 0, (11.2-6) 


subject to X[—1] = 0, where W[n] is white Gaussian noise with zero-mean and variance 
matrix, 7%. As earlier, the matrix A is taken to be nonsingular. 

The observations are no longer assumed noiseless. Instead, we will assume the more 
practical case where noise has been added to the signal prior to observation, 


Y[n] = X[n] + Vin]*t n>0 (11.2-7) 


where the random sequence V[n], called the observation noise, is white, Gaussian, and 
zero-mean. We take the observation noise V[n] to be stationary with variance matrix 


a 4 E[V[n]V[n]*], and we remember that all random sequences are assumed real-valued. 
Additionally, we assume that V and W are orthogonal at all pairs of observation times, 
that is, 

V[n) L Wk] for all n, k. 


Since the two noises are zero-mean, this amounts to saying that V and W are uncorrelated 
at all times n and k. We can write this more compactly as V | W. Furthermore we 
take V and W as jointly Gaussian so that they are in fact jointly independent random 
sequences. 

Our method of solution will be to first find the innovation sequence for the noisy obser- 
vations (Equation 11.2-7) and then to base our estimate on it. The Kalman predictor and 
filter are then derived. We will see that they have a convenient predictor—corrector structure. 
Finally, we will solve for certain error-covariance functions necessary to determine so-called 
gain matrices in the filter. 

Now we know that the MMSE prediction of the signal sequence X[n] based on the 
observation set {Y[k], k < n} is the corresponding conditional mean. Thus, we look for 


Xn] S E[X[n]|/Y¥[n — 1], Y[n — 2],..., Y[O]]. We first define an innovations sequence for 


¥A more general observation model Y[n] = CnX[n] + V[n] is treated in Problem 11.10. This general- 
ization allows modeling deconvolution-type problems. 
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Y[n] for the noiseless observations X[n]. Motivated by the requirements of the innovations 
of Definition 11.2-1, we define the sequence Y as follows: 


Y [0] Y(0] andfor n>1 
(11.2.8) 
Y([n] = Y(n] — E[Y[n]|¥[n — 1], Y[n — 2],..., Y[0]]. 


We now must show that Y[n] thus defined is an innovations sequence for Y[n]. To do this 
we must prove that Y[n] satisfies the three defining properties (see Definition 11.2-1): 


(1 
(2 
(3) Y[n] is an orthogonal (or white) random sequence. 


) Y[n] is a causal, linear transformation on Y[nJ, 
) Y{[n] is a causal, linear transformation on Y[n], and 


Now (1) is immediate by the definition Y[n] since Y[n] is a Gaussian random sequence. To 
show property (2) we note that we can recursively solve Equation 11.2-8 for Y[n] as 


Y(0] =Y(0] and 
Y[n] = ¥[n] + E[Y[n]¥[n — 1], ¥[n —2],..., Y[O]] 


where the Dp” are a known sequence of matrices,’ thus establishing (2). As an additional 
piece of terminology, when (1) and (2) hold simultaneously we say that Y and Y are causally 
linearly equivalent. 

To establish (3) we note that, for k < n, 


since Y[n] L [¥[n—1], ¥[n—2],..., Y[0]] by the orthogonality principle and hence is orthog- 


onal to any linear combination of the Y[k] for k < n. Similarly, we have E[Y[n]Y[k]7] =0 
for n > k. Thus, combining we have 


BLY [n]¥[A]7] = 03, [n] 5[n — kj 


for some variance matrix o% [n].f Combining properties (1), (2), and (3) we see that ¥[n| 
is a desired innovations sequence for the noisy observations Y [n]. 


tFor example, to compute Y[1] in terms of Y[0], Y[1] recall that Y[0] = Y[0], and Y[1] = Y[1] + 
E[Y[1|¥[0]]. But Y[0] = Y[0] and for the Gaussian case the MMSE is also the LMMSE so that 
EfY(1/¥(0]] = E[Y(1|Y[o]] = L[Y[1|Y[0]] = DoY[0]. Hence Y[1] = Y[1] + DoY[0]. As usual, Do is 
found by the orthogonality principle. 

*The reader should understand that o% [n] will not be constant since observations start at n = 0, so the 


innovations may be initially large. In fact Y[0] = Y [0]. 
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Since Y[n] and Y[n] are causally linearly equivalent, we can base our estimate on Y[n] 
instead of Y{[n] with expected simplifications due to the orthogonality of the innovations 
sequence Y [nl]. Since the data are Gaussian and the estimate must be linear, that is, E = EB , 
we can thus write 


E[X[n]|¥[n — 1],..., ¥[0]] = E[X[n][¥[n — 1],..., Y[0]], 
which by Theorem 11.1-4(b) becomes 


=> E[X[n]|¥ [A] (11.2-9) 
k=0 
aS CMYK] 
k=0 
where 
E[X[n]¥[k]7] 5 C02 [hk], OS k <n. 


Assuming that the variance matrix o% [k] is nonsingular, this equation can be solved for 
C\” to yield 


CY? = EX] Y[k)"]o57 1A. (11.2-10) 
So, we can also write the prediction estimate as 
n—-1 = 
= 2s kl" ]os [AVY [A]. (11.2-11) 


Since the signal model Equation 11.2-6 is recursive, we suspect there is a way around the 
growing memory estimate that appears in Equation 11.2-11. Substituting Equation 11.2-6 
for X[n] we can write 

E[X[nJ¥[k\"] = AE[X[n — 1¥[k]"] + BE[W[n] ¥[A]"], 


but for k <n,W{n] L X[k] and W[n] L V[k], which implies W[n] L Y[k] and hence also 
Wn] L Y[&A], so that we have 


E[X[n]¥[k]"] = AE[X[n —1J¥[k]"] for all k <n. 
Hence we can also express the prediction estimate X[n ] as 
=A yal E[X[n — YY [kK] "Jos? [A] Y [A]. (11.2-12) 


But Equation 11.2-11 must hold at n — 1 as well as n, thus also 


X[n-—1) = > E[X[n — 1¥ [k\ 05 (AYA). 
k=0 
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Combining this equation with Equation 11.2-12, we finally obtain 
X[n] = AX[n - 1) + AE[X[n - 1 Y¥[n - 1)" ]o3?[n - 1]¥[n - 1], 


which is an efficient way of calculating the prediction estimate of Equation 11.2-12. If we 
define the Kalman gain matrix 


Gp-1 = E[X[n - YY [n - 1] ]o3?[n- Uy, (11.9135 


we can rewrite the preceding result as 


n 


X[n] = A(X[n— 1] + Gn_1¥[n-1]), n>0 (11.2-14) 
with initial condition X[0] = 0. 
We can eliminate the innovations sequence Y[n] in Equation 11.2-14 as follows: 
Y[n] = ¥[n] — E[Y[n]/¥[n — 1],.--, Y[0]], 
so using Equation 11.2-7, we have 
E[Y(nll¥[n —1),..., ¥[0l] = E[X(n||¥(n — 1],..., ¥(0] 
+ E[Vinll¥[n—1),..., YO) 
= X[n] +0. 


for all k < n so that 


This last step is justified by noting that V[n] L X[k] and V[n] L V{k] 
— X[n]. So inserting this 


we have V{[n] L Y[k] for all k <n. Thus, we obtain Y({n] = Y[n] 
into Equation 11.2-14 we finally have 


Xn] = A[Xin ~1)+ Gp_i(¥[n — 1] —X[n— 1], (11.2-15) 


which is the most well known form of the Kalman predictor, whose system diagram is shown 
in Figure 11.2-2. 


Y[n-1] 


Figure 11.2-2 System diagram of Kalman predictor. 
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We can denote the prediction estimate in Equation 11.2-15 more explicitly as 
X[n|n — 1] = E[X[n]/¥[n — 1], Y[n — 2],..., Y[0}] = X[n]. 
On the other hand, we may be interested in calculating the causal estimate 
> A 
X[n|n] = E[X[n]|¥[n], ¥[n — 1],..., ¥[O]], 


which uses all the data up to the present time n. The Kalman predictor can be modified 
to provide this causal estimate. The resulting recursive formula is called the Kalman filter. 
One method to derive it from Equation 11.2-15 is the following. Consider the prediction 


X[n|n — 1] = E[X[n][¥[n — 1],..., Y[O]], 
and use the signal model Equation 11.2-6 to obtain 
X[n|n — 1] = AX[n — 1\n — 1] +0, 
since W[n] L Y[k] for k <n. So 
X[n — 1)n —1] = A7?X[n|n— J] 


since A is nonsingular. Using the result we pre-multiply Equation 11.2-15 by AW! to get 


X[n — 1]n —1] = X[n - 1)n — 2] + Gn_1(¥[n — 1] — X[n - 1[n - Q)). 
which can be written equivalently for n > 0 as 


X[n|n] = AX[n — 1\n—- 1] + G,, (Y[n] — AX[n — 1]n — 1), (11.2-16) 


known as the Kalman filter equation. Here take X[-1| -] £0. 

By examining either Equation 11.2-15 or Equation 11.2-16 we can recognize a predictor- 
corrector structure. The first term is the MMSE prediction of X[n] based on the past 
observations Y[k], k <n. The second term involving the current data is called the update. 
It is the product of a gain matrix G,, (which can be precomputed and stored) and the 
prediction error based on the noisy observations, which we have called the innovations, that 
is, the new information contained in the current data. 


Direct Filter Derivation 


An alternative derivation of the Kalman filter (Equation 11.2-16), which avoids the need to 
invert the system matrix A, proceeds as follows. First we write 


X[n|n] = E[X[n]/Y[n],..., ¥[0]] n>0 
= E[X[n]|¥[n],..., ¥[0]], 
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A 
X [n|n-1] 


Figure 11.2-3 System diagram of Kalman Filter. 


by use of Theorem 11.1-4 as before. Here Y[0] = Y[0], and role = E[X[n]¥7 [Alloy [A], 
0<k<n. Pulling off the k =n term, we write 


n-1 
X [n\n] = CMY [n] + 5° CL? Y[AI 
k=0 


= Cy’ Y[n] + X[n|n —1] by Equation 11.2-9 


= X[n|jn — 1] + CM Yn] 


= X[n|n — 1]+G,(¥ [nj — X[n|n — is. wed 


Figure 11.2-3 shows the system diagram of the Kalman filter. 
One can go on to derive Kalman smoothers, which are fixed-delay estimators of the 
form 
X [nin t+ kl & B[X[n]f¥[n +k], ¥[n + k-1],..., YO], k>0. (11.2-17) 
These delayed estimators are of importance in the study of various communication and 


control systems. As yet we have not discussed an efficient algorithm to calculate the sequence 
of gain matrices G,. This is the subject of the next section. 


Error-Covariance Equations 


We have to find a method for recursively calculating the gain matrix sequence G,, in 
Equation 11.2-13. We are also interested in evaluating the mean-square error of the estimate. 


The prediction-error variance matrix is the covariance matrix of X[n| S X[n|n —1)-X [rn]. 
We write it as 


E[X[n]X7 [nl]. 
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We start by inserting the observation Equation 11.2-7 into the innovations Equation 11.2-8 
to obtain 


Y[n] = X[n] + Vin] — X[n]# 
= —X[n] + V[n]. (11.2-18) 
But X[n] L V[n], so upon using X [nj for X[n|n — lj, 
E[X[n]¥7 [n]] = —B[X[n]X7 [nl] 
= E|(X[n] — X[n])X7 [n]]_ since X[n] L X[n] 
= E[X[n]X? [nl] 
= e” [nl]. 


Also X{n] L V[n], so we have from Equation 11.2-18 


E[Y [nJ¥7 [n]] = E[X[n]X7 [n]] + E[V[n]V" [nl] 


or 
o%,[n] = €*[n] + o¥ In); 
thus by Equation 11.2-13 we have 
G, = €?[n](e?[n] + 0%, [n])*, n>0. (11.2-19) 


The problem is now reduced to calculating the prediction error variance matrix e[n]. From 
Problem 11.3 we can write 


e?[n] = E[X[n]X7 [n]] — E[X[n]X? [ni]. (11.2-20) 


To evaluate the right side of Equation 11.2-20 we use Equation 11.2-6 and X[n—1] L W/[n] 
to get 
E[X[n]X?7 [n]] = AE[X[n — 1]X7[n — 1]] AT + Bot, B?. (112-97) 


Likewise using X[n] = A(X[n — 1] + G,_,¥[n — 1]) and X[n — 1] L Y[n — 1], we get 


E[X[n]X? [n]] = AB[X[n — 1]X7[n — 1]JA™ + AG,103,[n — 1JGT_, AT 
(11.2-22) 


= AE[X[n — 1|X7[n — 1] AT + Ac?[n — 1]GT_, AT 


where we have used Gy_1o%[n — 1] = €*[n — 1]. 


*Remember X[n] is the prediction estimate, X[nln — 1). 
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Substituting Equations 11.2-21 and 11.2-22 into Equation 11.2-20 and simplifying then 
yields 
e?[n] = Ae[n — 1](I- G?_,) A? + Bo, BT (11.2-23) 
for n > 0, where 


= 


e*[-1] = E[X[-1]X7[-]]] 


me [-1] = 0 (known) 
v_axeTr_iy J 9 if X[-1] =0 (known 
BETS ll ie if X is WSS." 
In summary, the Kalman filter for the state equation (Equation 11.2-6) and the observation 
model (Equation 11.2-7) thus consists of the filtering equation (Equation 11.2-16), the gain 
equation (Equation 11.2-19), and the prediction-error covariance equation (Equation 11.2-23). 
The filtering-error covariance can be calculated as 


e?[n] = B[(X[n] — X[n|n}) (X[n] — X[nfr])7) 
= e"[n][I- Gy] 
= Ae*|n—1]A7 + Bo%,B?, = n>0. 
The proof of this fact is left as an exercise to the reader. 


Example 11.2-2 
(scalar Kalman filter) Consider the Gauss-Markov signal model 


X(n] = 0.9X[n — 1] + WInl, n> 0, 
with means equal to zero and of = 0.19. Also X[—1] = 0. Let the scalar observation 
equation be 
Y[n] = X[nJ+V[n], 20, 


with of, = 1. We have A = 0.9 and B = 1. The Kalman filter Equation 11.2-16 then 
becomes 7 " a 
X(njn] = 0.9X[n — 1]n— 1] + G,(Y[n] — 0.9X[n — 1]n — 1), 


with initial condition x (4 — 1] = X{-1] =0. The Kalman gain Equation 11.2-19 is 
Gn = e*[n]/(1 + e*[n)), 
and prediction-error variance Equation 11.2-23 is given as 
e"[n] = 0.81e?[n — 1](1 — Gn—1) + 0.19 


= 0.81e2[n — ]] (1 ate! + 0.19 


_ 0.19 + 47[n- 1] 


>0 
ite[n—] an 


t Strictly speaking, this would require a minor modification of our development, since we have assumed 


X[—1] = 0. However, to model the process in the wide-sense stationary sense, we can just take X{—1] = X, 


a Gaussian random vector independent of W{n] for n > 0, and with variance matrix ox. 
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with initial condition e?[—1] = 0. We can solve this equation for the steady-state solution 


€* [oo] 
Al 
Sie 0.19 + eles) 
1 + €[oo] 
and discarding the negative root we obtain 
e7 [oo] = 0.436. 


Alternatively, we can use the MATLAB program, 


eps2(1) 

0.3193 
>> for n=2:20 

eps2(n) = ( 0.19 + eps2(n-1) ) / ( 1.0 + eps2(m-1) ) ; 
end 
a2 


3: 


to generate the plot shown in Figure 11.2-4, where we used €?[0] = 0.19. We note that 
convergence to the steady state is monotonic and rapid, essentially occurring by n = 10. 

Either way, we see that €?[n] — 0.436 and hence G,, — 0.304 so that the Kalman filter 
is asymptotically given by 


X[n|n] = 0.9X [n — 1|n — 1] + 0.304(Y[n] — 0.9X[n — 1|n — 1]) 
= 0.626.X[n — 1)n — 1] +. 0.304Y[n] 
and the steady-state filtering error is e?[0o] = 0.304. 


0.44 


0.42 


0 5 10 15 20 
—> rn 


Figure 11.2-4 Prediction-error variance €7[n]. 
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We have developed the Kalman filter as a computationally attractive solution for esti- 
mating a Gauss-Markov signal (Equation 11.2-6) observed in additive white Gaussian noise 
(Equation 11.2-7) over the semi-infinite time interval 0 < n < oo. The filter equations 
are time-variant because the initial estimates near n = 0 effectively have a truncated past. 
For our constant parameter and constant noise variance assumptions, the estimation error 
should tend toward an asymptotic or steady-state value as we move away from n = 0, at 
least for stable signal models. 

The Kalman filter derivation can readily be generalized to allow time-varying parame- 
ters A, and B,, in Equation 11.2-6. We can also permit time-varying noise variances o%,/[n] 
and o%,[n]. In fact, the present derivation will also serve for this time-variant case by just 
inserting the subscripts on A and B as required. 

The observation Equation 11.2-7 is also overly restrictive. See Problem 11.10 for a 
generalization to allow the observation vectors to be of different dimension than the signal 
vectors, 


Y([n] = C,,X[n] + V[n] 


In this equation, the rectangular matrix C, permits linear combinations of the state vector 
X to appear in the observations and hence can model FIR convolution. 

Kalman filters have seen extensive use in control systems and automatic target tracking 
and recognition. Their wide popularity is largely due to the availability of small and modest 
size computers of great number crunching power. Without such processors, Kalman filters 
would have remained largely of theoretical interest. There are many books that discuss the 
Kalman filter, see, for example, [11-3], [11-6], and [11-7]. 


11.3 WIENER FILTERS FOR RANDOM SEQUENCES 


In this section we will investigate optimal linear estimates for signals that are not neces- 
sarily Gaussian-Markov of any order. This theory predates the Kalman-Bucy filter of the 
last section. This optimum linear filter is associated with Norbert Wiener and Andrei N. 
Kolmogorov, who performed this work in the 1940s. The discrete-time theory that is 
presented in this section is in fact the work of Kolmogorov [11-8] while Wiener developed 
the continuous-time theory [11-9] for random processes. Nevertheless, it has become conven- 
tional to refer to both types of filters as Wiener filters. These filters are mainly appropriate 
for WSS random processes (sequences) observed over infinite time intervals. 

We start with the general problem of finding the LMMSE estimate of the random 
sequence X{n] from observations of the random sequence Y[n] for all time. We assume 
both random sequences are WSS and for convenience are zero-mean and _ real-valued. Our 
approach to finding the LMMSE estimate will be based on the innovations sequence. For 
these infinite time-interval observations, the innovations may be obtained using spectral 
factorization (cf. Section 8.4). 

First we spectrally factor the psd of the observations Syy(z) into its causal and anti- 
causal factors, 


Syy(z) = o7B(z)B(z74), (11.3-1) 
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oe — |e» | — f = 


Figure 11.3-1 Whitening filter. 


where B(z) contains all the poles and zeros of Syy that are inside the unit circle in the 
Z-plane. Hence B and B~' are stable and causal. We can thus operate on Y[n] with the LSI 
operator B~' to produce the innovations random sequence Y [nj as shown in Figure 11.3-1. 
The psd of Y[n] is seen to be white, 


See (2) = 0". 


Thus, Y[n| satisfies the three defining properties of the innovations sequence as listed 
in Section 11.2. We can then base our estimate on Y[n]. 


Unrealizable Case (Smoothing) 
Consider an LSI operator G with convolution kernel g[k] that yields 
a ESS ~ 
X[nJ= So gl[kl¥[n—- kl. (11.3-2) 


k=—co 


We want to choose the g{k] to minimize the MSE, 


E((X(n) — X(n))?]. 


Expanding this expression, we obtain for real-valued system g[nl, 


E[X?[n]] —2E bs [kX [n]Y [n — k] 
k 


+E (= g[k]¥ [n — 6) 
k 


= Kxx(0]— 20 glk] Kye [k] +0? D0 9° Ih] 
k 


+00 _ 2 +oo 
=Kxx(j+ \) lool ae 4 Ke [kl, (11.3-3) 
k=—0o k=—0o 


where the last line is obtained by completing the square (cf. Appendix A). Examining this 
equation we see that only the middle term depends on the choice of g[k]. The minimum of 
this term is obviously zero for the choice, 


glk] = Kye lk], —oo < k < +00 (11.3-4) 
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the LMMSE then being given as 


62 = Kxx[0|- - o Ky [ky (11.3-5) 


k=—0co 


This result can also be derived directly by using the EB operator, as shown in Problem 11.11. 
In the Z-transform domain the operator G, the Z-transform of the sequence g[k], is 
expressed as 


G(2) = Sup (2) = GSxv (2B), 


The overall transfer function, including the ens filter, then becomes 


= Sxy(z)/(o7B(z)B(z7")) 
= Sxy(z)/Syy(z), (11.3-6) 


66, 99 


where the subscript “wu” on H,, denotes the unrealizable estimator. The MSE is given from 
Equation 11.3-5 as 


using Parseval’s theorem [11-3], and then simplifies to 


a ae 


x |. (Sxx(v) - [|Sxv)?/Syy @)] dw. 


Example 11.3-1 
(additive noise) Here we take the special case where the observations consist of a signal 
plus noise V[n] with X L V, 


Y[n] = X[n] + Vin], —coo <n+0o 
The psd Syy and cross-psd Syy then become 
Syy (z) = Sxx(z) + Svv(z), 
Sxy(z) = Sxx(z). 
Thus, the optimal unrealizable filter is 


Sxx(z) 
Sxx(z) + Svv(z) : 


H(z) = 
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The MSE expression becomes 


1 ef Sxx(w)Syv(w) 


dw. 
27 Jon Sxx(w) + Syvy(w) 


Examining the frequency response 


Sxx(w) 


Fak} Sxx(w) + Syv(w)’ 


we see the interpretation of H,,(w) as a frequency-domain weighting function. When the 
signal-to-noise (SNR) ratio is high at a given frequency, H,,(w) is close to 1, that is, 


Hy,(w)~1 when Sxx(w)/Syv(w) > 1. 


Similarly, when the SNR is low at w, then reasonably enough, H,,(w) is near zero. 


Causal Wiener Filter 


If we add the constraint that the LMMSE estimate must be causal, we get a filter that 
can be approximated more easily in the time domain. We can proceed as before up to 
Equation 11.3-3 at which point we must apply the constraint that g[k] = 0 for k < 0. Thus, 


the optimal solution is to set 


g[n] = AK xyln) uln]. (11.3-7) 


The overall error then becomes 
1 Co 
e2 = Kxx(0] - a Ki elh, (11.3-8) 
k=0 


which is seen larger than the unrealizable error in Equation 11.3-5. 
To express the optimal filter in the transform domain, we introduce the notation 


[F(z)l4 = 3° flnje (11.3-9) 
n=0 


and then write the Z-transform of Equation 11.3-7 as 


G(2) = S[Sye(2)l. 


on 


The overall LMMSE causal filter then becomes 


oe 1 Sxy(z) _ 
H-(z) = =B(2) Fai (11.3-10) 
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where the subscript c denotes a causal estimator. The causal filtering MSE is expressed as 
+ 
2 1 . 


&. == 
° On Jon 


(Sxx(w) — He(w)Sxy (w))dw. 


Example 11.3-2 
(first-order signal in white noise) Let the signal psd be given as 


0.19 
1—0.92-!)(1 — 0.92)’ 


Sxx(z) = 


Let the observations be given as 


Y[n] = X[n] + Vin] 


where V[n] is white noise with variance of, = 1. Then the psd of the observations is 


Syy(z) = Sxx(z) + Svv(z) 


1 —0.627z—1] [1 — 0.627z 
1—0.9271 1—0.9z 


=¢* B(z)*B(2-*), 


= 1.486 | 


so that 


Een ~ la = wena - al . 


_ 0.436 " 0.273 
1—0.9z-!  2z-!1-— 0.627 " 

0.436 

~ 1—0.92-1’ 


where we have used the partial fraction expansion to recover the causal part as required by 
Equation 11.3-9. Thus by Equation 11.3-10, the optimal causal filter is 
1 0.436 
o?B(z) 1— 0.9271 
_ 0.304 
~ 1—0.6272-1° 


H.(z) = 


While Wiener filters do not require a signal model such as Gauss—Markov, as used in 
the Kalman filter, they suffer in comparison with the latter in that their memory storage 
requirements are greater and they are less conveniently adaptive to changes in system or 
noise parameters. 
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11.4 EXPECTATION-MAXIMIZATION ALGORITHM 


The expectation-mazimization (E-M) algorithm is an iterative method of obtaining a 
maximum-likelihood estimator (MLE) of a parameter @ or parameter vector @ of a prob- 
ability function such as the PMF (probability mass function), pdf (probability density 
function), and so forth. The MLE is obtained by forming the likelihood function Ix(@) = 
ewe fx, (Xi; 0), or log-likelihood function Lx (0) = log Th fx,(Xi;0), and, most often, 
(but not always) differentiating with respect to 6 to find an estimator 0 (a random vari- 
able) that maximizes the likelihood function. The MLE 0 depends only on the data, that 
is, 0 = d(X1,...,Xy). The principle is the same if @ is a vector parameter, that is, 
6 = (61,...,0%); in that case the MLE of @ would involve k functions {d;} of the data, 
that is, 0, = di(X1,...,Xw), 62 = do(X1,...,Xw),..-, 0x =dx(X1,...,Xw). However, 
(and this is the crux of the problem) what happens when the data X1,...,Xy are not 
directly observable? For example, suppose we observe not the vector of random variables 
X= (1, eae Xn) but, instead, Y= (Y1, seey Yur) where Y, = T,(X), see , Yu = Ty (X) 
and M < N. Here the functions {7;} are often many-to-one and describe the physical 
process by which the unobserved, but so-called complete data X gets transformed into the 
observed but incomplete data Y. For example, in computer-aided tomography (CAT), we 
measure Y; = y 4 aj,j;X;,i = 1,...,D, where the {Y;} are the D detector readings (the 
observable but incomplete data), the {a;;} relate to the geometry of the configuration, and 
the {X,} are the pixel opacities (the desired and complete data but not directly observable). 
If we wish to determine the parameters governing the distribution of the {.X;}, it seems that 
we must do so through the incomplete data {Y;}. Unfortunately this is hard to do in many 
cases because of the nature of the transformation from X to Y. The E-M algorithm enables 
us to find the MLE of the unknown parameter(s) by a series of iterations in which only 
the incomplete data Y are involved. Each iteration consists of two steps: the expectation or 
so-called E-step, and the maximization or so-called M-step. The expectation is with respect 
to the underlying variables X, using the current estimate of the unknown parameters and 
conditioned upon the observations Y. The maximization step provides a new estimate of 
the unknown parameters. 

There are many examples of complete and incomplete data. For example, suppose we 
want to estimate the mean of a Normal random variable X by making N i.i.d. observations 
X 1, X2,...,Xy on X. The data, sometimes called the complete data, allows us to obtain the 
MLE as 0 = 4 = 4 X;. However, suppose our data is corrupted by independent Normally 
distributed noise. Then instead of the clean X;, we now get the corrupted data Y; = X;+ Ni, 
i=1,...,n. Here the {Y;}, being a many-to-one transformation on the {X;}, represent the 
incomplete data. 

The E-M algorithm cannot work miracles. The MLE obtained from the {Y;}, whether 
by direct methods or by iterations as in the E-M algorithm, will generally have a higher 
variance than the MLE obtained from the complete data. We illustrate with an example. 


Example 11.4-1 
(MLE of the mean using incomplete data) We wish to obtain the MLE of the mean, p, 
of a real, Normal random variable X with variance o? from two independent observa- 
tions on X, namely X, and X»2. The joint pdf of the data is fx,x,(%1,2%2) = (2707)7! 
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exp(—4[(z1 — #)? + (a2 — s)?]/o?). By differentiating with respect to pu, we obtain the 


MLE as 6x = fy = Ds X;. Suppose, however, that our measurement is the datum 
Y = 7T(X) = 2X, + 3X. which is a many-to-one linear transformation. Clearly we cannot 
form the MLE 0x = 7, = $ pa X; from Y alone. However, it is easy to compute the pdf 


of Y as , 
_ 1 . 1 }y—5py 


The log-likelihood function of the random variable Y is 


Y —5y : 
log fy (Y) = —1/2log (26707) — 1/2 
g fv(¥) = ~1/2log(206n0*) ~ 1/2 (+) 
and maximizing with respect to js yields the MLE, based on Y, as 
~ 1 
Oy = fig = =Y. 
Y = be é 


While both 0x and Oy yield unbiased estimates of Ll, Var Oy] is greater than Var[ 6x] 


In Example 11.4-1 the random variable Y is an example of incomplete data. But this implies 
that there might exist other so-called left-out data that, when combined with the incomplete 
data, can yield an MLE with the variance associated with the complete data. 

We illustrate with an example. 


Example 11.4-2 
(continuation of Example 11.4-1) Suppose we could get a second measurement in Example 
11.4-1 that was functionally independent from the first, for example, say, W = X, — 4X92. 
Then we can easily obtain the joint pdf of fwy(w,y) in terms of fx, x,(@1, 22) as 


1 
fwy (w,y; 9) = a Fxix,(@i, 238), 


where x, = (4y+ 3w)/11, v2 = (y — 2w)/11, and the factor of 11 is the required Jacobian 
scaling. Here, we added the argument @ to emphasize the dependence of the pdf’s on the 
unknown parameter 6. The MLE of pis @wy = (SY + W)/22 which has the same variance 
as that of fi;. We can think of W as one form of the left-out data needed to make the 
observations complete. For a given transformation T there are many choices for W. Both Y 
and W are functions on the Euclidean space R! while the vector X = (Xj, X2) is a function 
onto R?. Thus the measurement and left-out data spaces are subspaces of the nonobservable, 
but complete, data space. 

The point of this example is to introduce the notion of left-out data and to show that 
we can write the pdf of X in terms of the pdfs of Y and W. Likewise, we can write the 
likelihood function of X in terms of those of Y and W. Indeed, in the more general case, 
with X, Y, W representing the complete, incomplete, and left-out data, respectively we can 
write that 

K' fx(x; 9) = fwy(w,y;9) = fwry(wly; 9) fy(y; 9), 


where K’ is related to the Jacobian of the transformation and is 1/11 in this example. 
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The next example illustrates a situation where obtaining the MLE by the direct method 
does not seem feasible and the E-M algorithm is required. 


Example 11.4-3 
(determining the parameters of an image using the E-M algorithm) Consider an image 
{F(i,j)}nxn represented by a (possibly long) vector F with N? components. It is shown 
in the research literature [11-10] that F can be decomposed into two parts: F = AF + V, 
where A is an autoregressive image model matrix and V is a zero-mean, stationary Gaussian 
process with diagonal covariance Kyy = 021. The observed image G is often a blurred and 
noisy version of F. It is modeled by a linear, shift-invariant operation on F by a matrix D 
followed by the addition of independent, additive Gaussian noise, that is G = DF + W. 
The covariance matrix of W is assumed diagonal, that is Kww = owl. The unknown 
parameter vector @ © {d(m,n), a(l, k), O%y, 7%}, where the {d(m,n)} and the {a(k, 1)} are 
the coefficients of the D and A matrices, respectively, is to be determined from the observed 
image, the known structure of the image and blurring models, and the a priori known forms 
of the probability functions for V and W. With Kae defined as the covariance matrix of 
G and given by 


Kee = D(I- A)7'Kyy((I— A)~1)D? + Kww, 
the MLE of @ is obtained from 

Our = arg maxg(—log(det |Kgc|) - G7 KGGG). (11.4-1) 
Because finding the solution to Equation 11.4-1 involves a complicated non-linear optimiza- 


tion in many variables, a direct solution does not seem possible. However, the problem has 
been solved by the E-M method and the solution is described in the published literature. 


Log-likelihood for the Linear Transformation 


We now extend the results of Example 11.4-2 to a general many-to-one linear transformation 
and write 


Wa = Ta Chi Rae RN) 
Wo = Ta( Xi, Moy.:0, Xm) 


Wr = De(X, Xo,..., Xn) 
Y, = Ley1(X1, X2,..., Xn) 


Yu = Leim(X1,X2,..., Xn), 
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where the L;,i=1,...,k +M =WN are linear operators. The appropriate transformation 
for the pdf’s is 


K' fx(x; 9) = fwy(w,y;9) = fwiy (wly; 9) fy (y; 9), (11.4-2) 


where K’ is the Jacobian scaling of the transformation and is of little interest in what 
follows. Written as a log-likelihood-function, Equation 11.4-2 can be converted to 


log fy (Y; 0) = log fx(X; 0) — log fwyy (WY; 0) + K, (11.4-3) 


where K & log kK’. With the exception of the constant term, each of the terms in 
Equation 11.4-3 can be interpreted as a likelihood function. If we take the conditional 
expectation of Equation 11.4-3, term-by-term, for some value of the parameter 0, say 6’, 
conditioned on both Y and the current estimator of 6, say g*) we obtain 


log fy(¥; 0") = Ellog fx(X; 0’)[¥; 0] — Ellog fwiy(WIY; @)[¥;0] + K, (11.44) 


where the left-hand side (LHS) of Equation 11.4-4 is a constant with respect to the expec- 
tation, since the conditioning is on Y. 
To shorten the notation, define 


u(6’,0) = Eflog fx(X; 6) /¥; 0 (11.4.5) 


and 
V (6,0) © Eflog fwiy(WIY; 6); 0] (11.46) 


so that Equation 11.4-4 can be rewritten as 
log fy(¥;4’) = U(6',0) —V(6',0) + K, (11.4-7) 
Now if it can be shown that the function V(-,-) has the property that 
v(6',0) < V(0™), 6), Condition 1, (11.4-8) 
and if 6’ is chosen such that 
u(e’,a) > u(e, a), Condition 2, (11.4-9) 
then, it follows from Equation 11.4-7 that 
log fy (Y; 6’) > log fy(¥;0) (11.4-10) 


or, equivalently, 

fy(¥3 6) = fy(¥;0). (11.411) 
Equations 11.4-8, 11.4-9, and 11.4-11 are the basis for the E-M algorithm. For Equation 
11.4-10 to be true, we must first show that Equation 11.4-8 is true. To this end define 


Z > fry (WY; 0')/fwry (WIY; 0) (11.4-12) 
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and use the fact that log Z < Z —1. Since this is true for every realization of Z, it must 
also be true for the expectations, conditional or otherwise. Thus, 


Ellog( fury (WIY; 6) /fwry (WIY; 0))/¥; 0) 
< Ellfwiy (W1Y 0)/ fai (WILY; 0) ]/¥; 0) — 1. (11.4-13) 
But, by definition of the expectation, the right-hand side (RHS) is merely 
al fw (wly; 9’) 
oo fwry(wly; 0) 
from which it follows that 


Ellog[fwry (WIY; 6')/fwiy (WIY; 0 )|/¥; 0] <0 (11.4-14) 


fwry (wly; 0)dw — 1 =0, 


or, equivalently, 
V(0,0) 2 Ellog fry (WIY; 6')1¥; 0) 

< Eflog fwyy (WI; 0) /¥; 0] 2 va, 6), (11.4-15) 

Thus, Condition 1 is met. To meet Condition 2, it is merely required to compute 
U(8,0) 2 Bllog fx(Xs0)/¥; 0) 
and update the estimate of 0“) by finding 0%+ as 
gk) — arg max U(8,0)), (11.4-16) 

The operation 0+) = arg max, U(6’,0™) is short for “the value of 6’ that maximizes 


U(6',0")) and is called 9° +)” 
The net effect of these operations is to achieve the goal of Equation 11.4-11, namely, 


fe(V;0°)) > fy(y;0). 


The E-M algorithm is not guaranteed to converge to the MLE of @. The algorithm, which 
we summarize below, can stagnate at a local minimum. Nevertheless the E-M algorithm has 
been used with good success in tomography, speech recognition, active noise cancellation, 
spread-spectrum communications, and still others [11-11]. 


Summary of the E-M algorithm 


Start with an arbitrary initial estimate 6) of the unknown parameter 6. Here we write 6 
as a scalar but it could equally be a vector. For k = 0,1,2,... compute 
1. The E-step: compute U(6’,0™)) 4 Ellog fx(X;4')[Y;9] as a function of 6’. 
(This is often the hard part.) 
2. The M-step: compute 9+) & arg maxg U(6', 0°"), 
Stop when g*) stops changing, or is changing so slowly that an imposed convergence crite- 
rion is met. 
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E-M Algorithm for Exponential Probability Functions 


A simplification of the E-step occurs when the pdf or PMF of the data belongs to the 
exponential family of probability functions. Such functions can be written in the somewhat 
general form as 

fx (x; 0) = b(x)c(0) explt(x)L" (6)] (11.4-17) 


where b(x) is a function that depends only on x, c(@) is a constant, (0) is a (possibly 
vector) function of the unknown parameter 0, and t(x) is a (possibly row vector) function 
that depends only on the realizations and is independent of the parameter 6. When used in 
a likelihood function, t(X) is called a sufficient statistic for 0 because, for the exponential 
family of pdf’s (or PMFs), it aggregates the data in a fashion that is sufficient to form an 
estimator for @. Being essentially an estimator, a sufficient statistic cannot depend on 0. A 
key requirement on a sufficient statistic is that the pdf (or PMF) of the data X, conditioned 
on the sufficient statistic, must not depend on @. 


Example 11.4-4 
Consider the joint PMF of N independent Poisson random variables X1,...,Xy. with 
Poisson parameters 6),...,9n, respectively. The joint PMF can be written as 


N 
= [|] ee? xd, (11.4-18) 


where X 4 (X1,...,Xw), Xx 4 (a1,...,@x), and 0 4 (61,...,9n). Using the fact that 


gr Set los we can rewrite Equation 11.4-18 as 


N 
Px (x; 0) = (11 5] x exp (-» D6 x exp(xI7 (6)), (11.4-19) 


Thus we associate b(x) with (1 tat), (8) with exp(— ee 6;), t(x) with x, and T'(@) 
with the (row) vector (log 0),..., log On). 


Still restricting ourselves to the exponential family of probability functions, we consider now 
the log-likelihood function associated with estimating the vector parameter @. This yields 


Lx(@) = log b(X) + log c(@) + T(@)t? (X). (11.4-20) 
The expression for the E-step is then 
U(6',0")) = Eflog[b(X)]/¥; 0] + log c(6’) + 1(@’) E[t? (X)/Y; a] 


However, recall that in the M-step, any expression not containing 6’ becomes irrelevant in 
the maximization routine. Hence we may drop the term E[log[b(X|Y; 0]]. Therefore for 
the family of exponential pdf’s or PMF’s, the E-M algorithm takes the somewhat simpler 
form: 
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Start with an arbitrary initial estimate 6) of the unknown parameter @. For k = 
0,1,2,... compute: 


1. The E-step: 
FD = Ele Y; 0M); 


2. The M-step: 
Ot) = are max|log (0) + tt DPT 9"), 


Repeat until the convergence criterion is met. 
For the important Poisson case the E-M algorithm takes a special form. 


1. The E-step: 
X@D = EX ly; a], 


2. The M-step: 


oF) — arg max - be 6 ) + K+). (log 64, 1log 05,...,log 4)" 
Repeat until the convergence criterion is met. 


Application to Emission Tomography 


Emission tomography (ET) is a medical imaging technique in which body tissue is stimulated 
to emit photons. Typically, a radioactive positron-emitting substance is attached to glucose 
and injected into the body of the patient. Areas of the body where the glucose is rapidly 
metabolized show up as “hot spots,” that is, regions of strong photon emissions. For 
example, in metastatic cancer, the tumors, because of rapid cell division and growth, exhibit 
above-average metabolic activity and therefore emit strong streams of photons. Another 
application of ET is in imaging the areas in the brain exhibiting the metabolism of glucose 
while the patient is engaged in various activities such as reading, playing chess, watching a 
movie, and so on. In this way, researchers can determine which part of the brain is involved 
while doing a particular activity. (See for example, the article by T.K. Moon [11-11].) 
Detectors located around the body collect the photons and the number of photons 
collected at each detector constitutes the data Y (Figure 11.4-1). The fundamental problem 


Object being 
imaged 


Detectors a 


G2 


Figure 11.4-1 Emission tomography configuration. 
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x, +A 


Figure 11.4-2_ In ET, the data consist of the sum of the aggregate outputs of the individual cells. In 
this illustrative example, the summations are along the rows and the columns. 


in ET is, then, to reconstruct an image of the spatial distribution of the photon emission 
vector A from the data Y. 

It should be noted that the detector data is often incomplete to begin with. We illustrate 
with an example. 


Example 11.4-5 
A greatly simplified tomographic configuration is shown in Figure 11.4-2. Each cell emits 
photons in a Poisson mode. The photons emitted during a certain interval of time are 
represented by the components of the vector K = (X,, X2,X3,X4)', which denote the 
number of photons emitted from cells 1, 2, 3, 4, respectively. 

The vector X can be said to constitute the complete data. The MLE of the photon 
emission vector A = (Aj, A2,A3, Aa)" is readily shown to be y= X;, i= 1,4. However, the 
data collected at the four detectors are represented by 


Y = (X14 X3, Xo + Xs, X3 + Xa, X1+ Xe) 


which is incomplete. For example, exactly the same data Y would be collected if the complete 
data were given instead by X’ = (X, + 6, X2 — 6, X3 — 6,X4 + 6)". Hence, there is a 
many-to-one transformation in going from X to Y. 


Following the example in the previously cited article by Moon, we assume that it is desired 
to estimate A, the Poisson spatial emission function. The object to be imaged consists of B 
cells and the Poisson law governs the emission of photons from each cell. The components 
of A are the Poisson emission parameters of the cells, that is, A = (Ai, 2,..-,A8). Thus, 
during T seconds, the probability of n photon emissions from cell b is P[X, = n|Ap] = 
exp(—A,T) - (ApT)"/n!, where X; is the number of emitted photons from cell b, and A, is 
the Poisson rate parameter from cell b. Without loss of generality, we take T = 1. Let ppa, 
b=1,...,B; d=1,...,D denote the probability that a photon from cell b is detected by 
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detector tube d. The {p,q} can be determined from the geometry of the sensors with respect 
to the body. It is assumed that no photons get lost: pee Pod = 1 that is, all the photons 
emitted by cell b are captured by the D detectors. The joint probability of the number of 
photons captured at each of the D detectors obeys the multinomial law if we condition this 
probability on the total number of emitted photons. Without such conditioning, the number 
of photons, Ya, collected at collector d is Poisson (we leave the demonstration of this result 
to the reader) with PMF 

Pl¥a = y| = e7*4(Aa)"/y! (11.4-21) 
Let Xzq denote the number of emissions from cell b detected by detector d. The set 
X = {Xpa, b = 1,...,B;d = 1,...,D} represents the complete but unobserved data. 
The incomplete data are the detector readings {Y,: d = 1,...,D}. The many-to-one map 
is implied by the system of equations: 


B 
eS Sly), (11.4-22) 
b=1 
where we assume that B > D. 
The expected value of Yq is given by 


E(Yal = a (11.4-23) 


B 
= S> Ex, [E[Xpa|Xo]] 
= 


II 
Me 


Ex, [Poa Xo] 


o~ 
Il 
fan 


oPbd- (11.4-24) 


II 
Me 


o~ 
Il 
fan 


Hence, Ag = sy AvPoa and each random variable Xyq is Poisson with Apg = AyPoa. We 
now demonstrate the use of the E-M algorithm to estimate A = {j,...,Az}. 


Log-likelihood Function of Complete Data 


A basic assumption is that each cell emits independently of any other cell and each detector 
operates independently of any other detector. Under this assumption the likelihood function 
lx (A) is given by 


Ix(A) = [] Px (Xa) 
b,d 
(11.4-25) 
= II e4 (Aypoa) <4 / Xa 
b,d 


where, for convenience, we omit the explicit dependence of the {Xa} on lx (A). 
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As usual, we prefer to deal with the log-likelihood function, Lx(A), by taking the natural 
logarithm of Ix(A). This yields 


Lx(A) = S> (—Avpoa + Xoa log Ax + Xvalog pra — log Xva!) (11.4-26) 
b,d 


We wish to find the vector A that maximizes this expression. The E-M algorithm will be 
used to estimate the unobserved data using the current best estimate, A“), of A. Then 
the estimated unobserved data X‘*+) will be used to improve the estimate of A“) to 
A’+1)_ The procedure will be repeated until the convergence criteria established by the 
user are met. 


E-step 


Assume that we are at the kth step and our current estimate of A is A“). We compute the 
E-step as 


k k R k 
XE) = Ex ry, A®) = BEX |y,, A] 


(independent photon aggregate at the detectors) 


B 
[van that Yq = ‘ye 7) 


=> =>" LoaP |X, ) = Xbd yx? = =ya,A (k) 
Bra i=1 
B 
P ba, ca. = Ya — Loa 
ib 
= a Lbd BY, = (using Bayes’ rule, rewriting the 
Lbd 


second of the joint events, and submerging the condition on A“)to save space) 


P[x{®) = tal p)yox® = Ya — Loa 


= = 
<i P[Ya = yal 
Lod 
(using the independence of the x), (11.4-27) 
Now use 
P(X = ava] = exp(—A,”) - (Ab) ™4 /aya! (11.4-28a) 
B B B Yd—Xbd 
PLS> Xi? = ya- ava] =exp (- SOP | | SOAP /(Ya — Za)! (11.4-28b) 


igkb ikb ikb 
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B Ud 
P [Ya = ya] = exp (- » a (>. a) /ya! 
w=1 


and ait) = \) xX Pbd- (11.4-28c) 


In Equation 11.4-28b we used the result that the sum of Poisson random variables is Poisson 
with the parameter being the sum of the Poisson parameters of the random variables. 

After substituting Equations 11.4-28 a,b,c into Equation 11.4-27 and performing some 
elementary algebraic manipulations, we obtain 


wie 
Sy Sha (11.4-28d) 


» M” pia 
i=l 


which can be seen to be a generalization of Equation 4.2-21. 
M-step 


The M-step is usually easier to realize computationally than the E-step. We merely maximize 
the log-likelihood function with respect to Ap» using the updated estimate of the complete 
data X ‘aes 

Thus, set 


O 
0= a ——Lyinti(A) 


which yields the desired result as 


D 
ay (11.4-29) 
d=1 


In obtaining Equation 11.4-29 we used the fact that aa Poa = 1. 
Finally, Equation 11.4-28d can be combined with Equation 11.4-29 to yield the single 
update equation: 


D 
Y, 
ae ee 0 eed pF (11.4-30) 
d=1 os pig 
w=1 


We then repeat the steps of the EM algorithm until the convergence criterion has been met. 

Clearly the EM algorithm is non-trivial to execute. Without computers its application to 
all but the simplest problems would not be feasible. It has become a powerful technique for 
MLE-type problems in engineering and medicine. For further reading on the EM algorithm, 
see [11-12, 11-13]. 
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11.5 HIDDEN MARKOV MODELS (HMM) 


In ordinary Markov models the underlying stochastic mechanism is the transition between 
states. The sequences of states, as time progresses, are the observations. Each state corre- 
sponds to a physical event. The evolution of the states as a function of time (or displace- 
ment, or volume) forms a Markov random sequence. The observer knows the Markov model 
itself. 

For a class of engineering problems of growing importance such as speech and image 
processing by computer, the known model assumption is too restrictive. For this reason 
certain classes of Markov models that exhibit a second degree of randomness have been 
introduced. These models are called hidden Markov models (HMM’s) because the data avail- 
able to the observer are not the evolution of the states but a second stochastic process that is 
a probabilistic function of the states. Thus it is not known which sequence of states produced 
the observed data. Indeed there are many state sequences that could have produced the same 
observations. An important question in this regard is, given a vector of observations, which 
state sequence was most likely to have produced it? 

Excellent tutorial articles have been written on HMM’s. In particular the tutorial arti- 
cles by Lawrence Rabiner and his colleagues [11-14, 11-15] have facilitated the understanding 
of HMM’s by the nonspecialist in statistics. In what follows, we shall closely follow the 
discussion and style of these tutorial articles. 

In certain types of problems, not only is it unknown which sequence of states produced 
a given observation recognition but it is also not even known which of several competing 
Markov models was the most likely generator of the given observations. We illustrate with 
an example adapted from [11-15]. 


Example 11.5-1 
(coin tossing experiment) We are given the results of a coin tossing experiment in which 
each observation is either a head {H} or a tail {T}. We do not know which of two models is 
responsible for the observation sequence O = {HTT}. The two models, My, Mo, are (1) two 
biased coins with a stochastic method of choosing between them for the next coin flip; or 
(2) three biased coins with a stochastic mechanism for choosing among them for the next 
flip. We use the notation P’[H|C,;] to mean the probability of getting a head in model i 
using coin C; in that model (note that 7 = 1,2 in the first model and j = 1,2,3 in the 
second model). We let the choice of coins be the state. Suppose we know or estimate the 
following probabilities: 


Pyne, =03, PT) =07, PG) =06, PTic,)=—04 and 
nC |=05, Tic) =05,. PB) =08, Pe Tics)|=02, and 
P*(H\C3|)=0.4; PT \C3| = 0.6. 


These probabilities are known as the state conditional output symbol probabilities. The state 
S[n] at time n is the particular coin in use at time n. The event {S[n] = C1, Mz} means that 
the state in model two at time n is C,. We next need to specify the stochastic mechanism by 
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which coins are selected that represents state transitions. Specifying a set of state transition 
probabilities does this. We use the notation aj,[k] = P[S[k] = C;|S[k—1] = C;, Mj] to denote 
these state transition probabilities, conditioned on model one, and likewise for model two: 
a3.[k] = P[S{k] = C;|S[k — 1] = C;, Mg]. In the models we consider, the state transition 
probabilities are time-invariant. That is, the probability of reaching one state from another 
remains the same whatever the time might be. For the sake of illustration we assign values 


to the state transition probabilities as in Figure 11.5-1. 


al, = 0.6 


a\,= 1-al,= 0.4 


a}, =1 = aby 0.3 


PI[H|C,] = 0.3 P'IHIC,] = 0.6 
P'T|C,] = 0.7 P'[T|C,] = 0.4 


a4,= 0.3 


P2[H|C,] = 0.8 
P(T|C,] = 0.2 


P2[H|C,] = 0.5 
P?[T|C,] = 0.5 P?[HIC,] = 0.4 


P?(T|C,] = 0.6 


Figure 11.5-1 Two possible models that might have produced the observation sequence O = {HTT}. 
In each of the two models there are obviously many state sequences that could have produced this 
observation vector. A basic question is, which of the two models was most likely to have produced 
this sequence? Another basic question is, how do we efficiently compute the probability that a model 
produced the observation vector? 
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To finish the description of these models we need to specify the initial state probabilities, 
that is, the probabilities of being in the various states at time n = 1. We denote these 
probabilities by pj[1] or p3[1], 7 = 1,2 for model one and j = 1,2,3 for model two. It is 
convenient to define a set of initial state probability vectors p+{1] = (pt[1], p3[1]) for model 
one and p?[1] = (p?[1], p3[1], p3[1]) for model two. For the sake of specificity we assign 
p'[1] = (0.7 0.3) and p?(1] = (0.8 0.1 0.1). If the initial state probability vector has zeros 
everywhere except a one in position j, then the initial state must be coin C;. The two models 
are shown in Figure 11.5-1 


HMM’s have considerable versatility in describing and recognizing complex random 
phenomena such as natural speech. Indeed HMM’s are used in natural speech recogni- 
tion by computer. An HMM is typically a doubly stochastic construct and its complexity 
and versatility depend on the number of states and whether the states are massively or 
sparsely interconnected. 


Specification of an HMM 


We limit ourselves to HMM’s where output symbols come from a discrete alphabet with 
a finite number of symbols. To fully specify an HMM we need to specify the following 
parameters: 


(1) The number of states N. In speech processing, each state might represent a different 
position of the vocal organs. Then the underlying stochastic mechanism of moving 
between states is associated with the dynamics of the vocal organs as a spoken 
word is being produced. The inertia of the mass of the vocal organs as well as 
the limited speed of nerve impulses constrains which states can be reached from 
other states. Also if the number of different positions of the vocal organs is very 
large, then the number of states will be very large and, indeed, might be too large 
for computational purposes. If the number of states has to be kept small, again 
for computational reasons, then the one-to-one relation between states and the 
position of the vocal organs becomes blurred. This does not mean, however, that 
the HMM cannot be used effectively. All states must have realizations in the vector 
Q = (m,.--,¢n)- Thus the statement that at time n the HMM is in state q; is 
written as {S[n] = q;}. 

(2) The N x N state-transition probability matrix A = {a;;}. In more complex models 
the elements a;;[n] a P[S[n] = q;|S[n—1] = qi] might depend on the time index n. 
Here we assume that the state transition probabilities do not depend on time. 
Therefore we write aj; and not a;;[n]. Note that if aj; = 0, it is not possible to go 
to state q; from state q; in one step. If aj; > 0 for all 7, 7, then every state can be 
reached from every other state in one step. Note that the row probabilities in A 
must add to unity, that is, 4 a,j = 1,7 =1,2,...,N since the HMM must be 
in one of the N states. 

(3) The set of discrete observation symbols V = {v1,v2,...,vmu}. The number M 
is called the discrete alphabet size. It is elements of the set V that are actually 
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observed, not the states which produced them. In the case of speech or music; 
the actual physical output is typically considered a continuous-time, continuous- 
amplitude process. This means that to model such a process with a discrete-time, 
finite set of symbols, sampling and quantizing are required. 

(4) The state conditional output probabilities. These are the probabilities b;;, of observing 
the output symbol v; while in state q;. By definition bj, = P[X[n] = vz|S[n] = qj] 
fork=1,...,M andj =1,...,N. The random variable X[n] is the observation at 
time n. In a more sophisticated model these probabilities could depend on the time 
index n, but here we assume time independence, hence we write bj, and not bj, [n]. 
The state conditional output probability matrix B = [bjx] is N x M. 

(5) The initial state probability vector. This is the vector p[1] = (pi [1], pa[1], ..-, pw[1]) 
whose components p;[1] are the probabilities of starting the observation sequence 
in state qj, j = 1,2,...,N. As in Example 11.5-1 if for some j, p,[1] = 1, the 
sequence must begin in state q;. 


For convenience the HMM can be defined by the six-tuple 
M = (N,M,V,A,B, p[l)) 


although it is customary in the applications literature [11-11] to use the more compact 
notation M = (A,B, p[1}). 


Example 11.5-2 
(describing an HMM) Describe the two models in Example 11.5-1 using the parameters of 
the six-tuple M, 


Solution For model one we obtain N = 2 (two states), M = 2 (two characters H and T), 
and, 


0.6 0.4 0.3 0.7 
V =(H,T),A= Fe a ,B= ie i ,p[1] = [0.7 0.3]. 
For model two we obtain N = 3, M = 2, and 
0.3 0.3 0.4 0.5 0.5 
V =(H,T),A= ] 0.1 0.6 0.3) ,B= }0.8 0.2] ,p[l]=[0.8, 0.1, 0.1). 
0.7 0.1 0.2 0.4 0.6 


In digital signal processing, an HMM can be designed for simulation by using the prescribed 
parameters to generate realizations of the random observation sequence X[1], X[2],..., 
X(n]j,...,X[Z]. Here L is an integer that denotes the maximum number of observations. A 
procedure to realize this design might proceed as follows: 


(1) Choose a realization of the initial state S[1] according to the initial-state distribu- 
tion vector. For example, if N = 3 and p,[1] = 0.6, po[1] = 0.3, and ps[1] = 0.1, 
use a uniform random number (RN) generator to determine the initial state. If 0 < 
RN < 0.6, S[1] = qm; if 0.6 < RN < 0.9, S[1] = qo; andif 0.9 < RN < 1.0, S[1] = gs. 

(2) Setn=1 
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(3) Obtain a realization of X[n] = vp(n) € V according to the state conditional output 
probabilities in B. Again a random number generator can be used for this purpose. 

(4) Use a random number generator (or other pseudo-stochastic or stochastic method) 
to transfer to the next state in accordance with state transition probabilities in A. 

(5) For n < L set n= n+ 1 and go to step (3). Otherwise terminate the procedure. 


Thus an HMM can be used as a generator of observations. It can also be used in reverse 
fashion, that is, given a sequence of observations, determine which of several competing 
models was most likely responsible for the observations. 


Application to Speech Processing 


As a review of the literature will show, HMM’s are extensively used in speech recognition. 
Here, we only briefly review the most basic aspects of isolated word recognition. Indeed, a 
description of the procedure to merely obtain suitable observation vectors could easily take 
more than a chapter of a book. We mention only that the observation vectors are obtained 
from the spectral content of the speech sample through a process called linear predictive 
coding (LPC) which is extensively discussed in the literature on speech processing [11-16]. 
In speech recognition each of, say, K words is modeled by an HMM and the totality of 
all K words are then modeled by K HMM’s that we denote as My, Mo,...,Mx, where 
M; = (A;, Bj, p;[1]). The M;, 7 =1,..., & are the so-called word models. The problem of 
designing an appropriate word model for a given word is often considered the most diffi- 
cult of the various tasks associated with speech recognition. It requires extensive training 
involving many human talkers if it is to be speaker independent. Fortunately, it need be 
done only once for each word and is done off-line, that is, before the model is used as an 
automatic word recognizer. 


Example 11.5-3 
(the left-to-right model) An example of an HMM that is useful in isolated word recognition 
is the left-to-right HMM shown in Figure 11.5-2. 

The left-to-right HMM has the property that the A matrix is upper-triangular and that 
there are single initial and final states. Once the process enters a new state it can never 
return to an earlier state. The initial state distribution vector is p[1] = (1,0,0,...,0). When 
in the current state, the model can repeat the current state or advance one or two states. 


Figure 11.5-2 A five-state left-to-right HMM used in word recognition. 
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The basic problem in spoken word recognition is to determine which from among the 
K word models M;,Mbz,...,Mx is the source of the event E = {X[1] = vpa), X[2] = 
Vp(2),...,% [L] = pry}. In keeping with the literature, we call E the complete observation 
sequence. We remind the reader that v,(;) € V for alli =1,..., Z and p(t) € {1,2,..., M}. 
A maximum a posteriori (MAP) estimator, M*, would require solving M* = arg max; 
P(M,|E]. This, in turn, would require a priori knowledge of the distribution of the words 
which is not generally known. If, on the other hand, we assume that all the M; are equally 
likely, then the MAP estimator is equivalent to the mazimum-likelihood estimator (MLE) 
M* = arg max; P[E|M,]. The efficient computation of P[E|M,] for each word is an impor- 
tant consideration. A brute force approach, that is, one following directly from the definition, 
is not feasible. 


Example 11.5-4 
(number of operations needed to select the Markov model) Assume a massively intercon- 
nected HMM with N states and L observations. Show that the number of operations 
required to compute P[E|M] from the definition is of the order of N”. 


Solution Computing P[E|M] directly from the definition requires summing the proba- 
bilities over all possible state sequences. A particular state sequence is sometimes called a 
path. If we use the symbol Q; to denote the event that the observation vector is associated 
with the ith path, then P[E|M] = 5°, P[E, Q;|M] = 3°; P[E|Q:, M)P[Q;|M]. 

For each of the observations, starting at n = 1 and ending at n = L, there is a choice 
of N states and, therefore, for all L observations there are ~ N” possible state sequences. 
For N = 5 and L = 100, there are of the order of 5!°° sums of products. A more precise 
calculation [11-14] shows that there are 2LN” operations required. For the numbers given 
above, these amount to ~107? operations. We consider next a more efficient approach to 
this problem by capitalizing on the structure of the computations. 


Efficient Computation of P{E|M] with a Recursive Algorithm 


As Example 11.5-4 demonstrated, computing P[E|M] without considering the underlying 
lattice structure of the state transition sequence is a hopeless task. A recursive algorithm 
that does take advantage of this structure is the so-called forward-backward procedure. The 
forward procedure refers to an iteration that begins at n = 1 and proceeds to the present, 
while the backward procedure refers to a very similar algorithm that proceeds to the present 
from n = L. We shall focus primarily on the forward procedure. Either approach yields the 
same results. The forward procedure does not require the availability of all the observations 
before it can proceed. We shall first describe the algorithms and then explain why they are 
so much more efficient than the direct approach. 

Define the event E, & {X[1] = Vpqay,---,X[n] = Ypcny} and the forward variable 
Qn|?] = P[En, S[n] = q|M]. Here i refers to the state q; and n, whether as an argument or 
subscript, refers to time. In words a, [i] is the probability of the joint event E,, and that the 


present state is q;. We require the initialization aj [i] 4 pill] x PLX[1] = vpay|S[1] = &, MY. 
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Note that an41[9] = P[En41,$[n + 1] = q;|M]. Had we known that the previous state 
had been q, then we could have written that an+1[j] = an|[i]aijb;, p(n): (Recall that a;; is 
the transition probability of going to state q; from state q; and 6; yn) is the probability of 
observing the symbol v,(,) while in state q;.) However, since the transition to state q; could 
have come from any state for which a;; ¢ 0, the Gitte recursion is 


An+1[9 = Saul t]aygb j,p(n) (11.5-1) 


forj=1,...,Nandn=1,...,£—1, When n = L—1, we obtain az[j] = P[E, S[L] = q,|M], 
where we implicitly let E, = E. Now recalling from basic probability theory that if a 
sequence of events {A;}, j = 1,...,N has the property that A;A; = ¢ (the empty set) 
for i A j and U;A; = 2 (the certain event), then for any event B, P[|B] = = P[BA\]. 
Associating the {A;} with {.S[L] = q;} we obtain the important result that 


P[E|M] = Sout (11.5-2) 


An estimate of the number of computations for the forward algorithm is easily obtained as 
follows: From Equation 11.5-1 we see that for a fixed j and n, the computation of a,+1[j] 
requires ~N operations. Repeating the process for 7 = 1,2,...,.N requires a total of N? 
operations. Finally repeating the calculation for n = 1,2,...,L —1 yields a total of ~N?L 
multiplication/addition type of operations, to be contrasted with 2LN¥ using the definitions 
(or ~2500 versus ~200 x 519°). To what is owed this saving of roughly 69 orders of magnitude? 

As an analogy consider a surveyor who is assigned to measure the road distances from 
a distant city, say, A to three cities B, C, and D that are quite near each other. The road 
map is shown in Figure 11.5-3. 

An efficient way to measure the distances dap, dac, dap is as follows: Measure dp. 
Then measure dgco and dcp and compute dag = dap + dpc, and dap = dac + dop. 
Alternatively, the surveyor could measure d4p, return to A, and measure the total distance 
dac without considering that to get to C he must pass through B, return to A once again, 
and measure the total distance d4p. In the former method the surveyor builds on earlier 
measurements and knowledge of the road map to get subsequent distances. In the second 
case, the surveyor makes each measurement without building up from earlier results. It 
is essentially the same with computing P[E|M] from the definition. To do so ignores the 
lattice structure of the computations and ignores the fact that different state sequences 
share the same subsequences. In the recursive approach, the computations build on each 
other. As n increases, the updated probability computation builds on the previous ones. 


ee ee 


B 


Figure 11.5-3 A road map from city A to cities B, C, and D. 


756 Chapter 11 Applications to Statistical Signal Processing 


93 ———————— DiscreteTime 
n=1 n=2 n=3 n=4 


Figure 11.5-4 Lattice structure of the implementation of the computation of aplj]. 


The lattice structure of the algorithm is shown in Figure 11.5-4 for three states and for four 
time increments. It should be clear from the diagram that as time increases, different state 
sequences remerge into the same three states, sharing many of the previous subsequences. 


One can also define an event E’, S {X[n + 1] = vp(n41),---,X[L] = vpzy)} so that 


E, UE}, = E. In terms of this event one can define the so-called backward variable £,, [i] a 
P[E!,, S[n] = q;|M] with the arbitrary initialization 6,[i] = 1, i = 1,...,N to obtain the 


recursion 
N 


Ball = >> Bags lI aizbi,p(n)- (11.5-3) 


j=1 


The recursive algorithm to compute P[E|M] using Equation 11.5-3 has the same compu- 
tational complexity as the one using the forward variable a,j]. We leave the details as a 
exercise for the reader. 


Viterbi Algorithm and the Most Likely State Sequence 
for the Observations 


In the algorithm based on the forward variable (a similar statement applies to the backward 
variable recursion), we computed P[E|M] by averaging over all state transitions. An alter- 
native procedure, which is often more efficient, is to find, for each model, the path that was 
most likely to have produced the observations. Sometimes this path is called the minimum 
cost path in the event that a criterion other than most probable is used. The algorithm for 
finding such a path is based on the principle of dynamic programming and the algorithm 
itself is called the Viterbi algorithm [11-17]. The principle of dynamic programming can be 
illustrated as follows. Suppose we wish to find the shortest path from a point A to B ina 
connected graph with many links and nodes and many possible paths from point A to B. 
Consider a node 7;; then among all paths going from A to B via 7,, only the shortest path 
I* from 7; to B needs to be stored. All other paths from 7, to B can be discarded. This 
follows from there being only two possibilities: Either the shortest path from A to B takes 
us through 7, or not. If it takes us through 7,, then any path different from J* will increase 
the overall path length. If the shortest path does not go through 7, it must go through 
some other node, say, 7;. Then we repeat the process of finding the shortest path from 7; 
to B. In this way we can backtrack back to A from B always storing the subsequence of 
nodes that yield the shortest paths. 
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Consider now the application of this reasoning to finding the sequence of states that 
was most likely to have produced the given partial observation vector, that is, the event E,. 
Equation 11.5-1 is the recursion that averages over all paths. If we replace this recursion 
with the most likely path algorithm, we need to replace the summation by a maximization 
over all path sequences leading to the present state. To find the single most likely path Q* = 


(Gay 2)? ede Gt) we proceed as follows. Define the event S,,[7] = {S[1] = gpa), S[2] = 


Yp(2)>-- +» S[n] = qi}. From elementary probability theory we can write 
P[S,[¢], En|M] 
PS, |B, M = eee 


Since the denominator does not depend on §S,,[7], maximizing the left-hand side is equivalent 
to maximizing the right-hand side over the state sequences leading up to state q;. For the 
Viterbi algorithm we define the variable 


Pnlt] = max P{S,,[¢], E,|M] (11.5-4) 
Ip(1) 4p(2) +++ Ip(n—1) 
and observe that 
Yn4il)] = max Y,, [i] 4i75;,p(n41)- (11.5-5) 


The interpretation of Equations 11.5-4 and 11.5-5 is as follows. Equation 11.5-4 finds the 
most probable path to state q;, at time n, that accounts for the observation vector E,,. 
Now to find the most probable path at time n+ 1 to state q;, we need only consider 
the most probable paths to states qi, q2,...,qn since the overall most probable path must 
transit through one of these states at time n. This is a forward application of the dynamic 
programming principle if we replace “shortest path” in the earlier discussion on dynamic 
programming by “most likely path.” Figure 11.5-5 shows how the recursion works. 


er 
State 


Discrete Time 
n n+1 


Figure 11.5-5 The heavy black lines leading to the state nodes at time n represent the overall most 
probable path to these states. For simplicity’s sake, think of these computations as having been done 
according to Equation 11.5-4. The dotted lines indicate tested paths that have been rejected. The most 
likely paths to the various states at n+ 1 are then computed as in Equation 11.5-3; these are shown 
as heavy black lines leading to the state nodes at n+ 1. At each instant of time, we must store the N 
values of y,[j], j= 1,...,N. For L observations we must store, therefore, N L such values in addition 
to the observations themselves. 
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Below, we furnish the entire Viterbi procedure for finding the most likely state sequence 
consistent with the observations: 


1. Initialization 
2, [é] = ps[1] bi pay),t = 1,...,N3 
v(t] =0,i=1,...,N. 


The w,,[¢] are the path tracking functions. 
2. Recursion 


Yn+ilJ] = max $,,[#]aijbj,p(n41)],” = 1, oe .,L- 1g = 1, ne .N; 


Vnild] = arg max(y,,|i]aij],n =1,...,L-—1;7 =1,...,N. 


3. Termination A ; 
P* = PIE, S*[L\|M] = max 9; [i]; 


q, = arg max |i). 
4. State sequence backtracking 


In =,04(4,4).7 = L- 1,£-—2,...,1. 


For any but small values of NV, T these recursions require the aid of a computer. We illustrate 
with a hand-computable example. 


Example 11.5-5 
(numerical illustration of the Viterbi algorithm) In Example 11.5-1, the sequence HTT is 
observed. For model one, compute the most likely path and its probability using the Viterbi 
algorithm. 


Solution For model one, the parameters are, from Example 11.5-2, 


A= fe | pl] = (0.7 0.3) V=(H,T) and B= | 


0.3 0.7 
0.3 0.7 , 


0.6 0.4 
Step 1: Initialization (n = 1) 

[Hlq] = 0.7 x 0.3 = 0.21: 

(H|g2] = 0.3 x 0.6 = 0.18; 


Vv 


Step 2: Recursion 
n=2 
yell] = max{y, [Lau x P[T|q], 9: [2]aa1 x P[T|qi]}; state a; 
= max{0.0882, 0.0378} = 0.0882; 


Sec. 11.6. SPECTRAL ESTIMATION 


759 


v3 


1] =1; 
2] = max{¢;[L]ai2 x P[T|q2], 41 [2]a22 x P[T|q2]}; state go; 
= max{0.0336, 0.0504} = 0.0504; 

YA 

t= 3: 

1] = max{¢[lau x P[T|q1], ¢2[2]a21 x P[T|qi]}; state q; 
= max{0.0370, 0.0106} = 0.0370; 

1j/=1; 
2] = max{¥[1]a12 x P[T|q2], Y2[2]a22 x P[T|q2]}; 
= max{0.0141, 0.0141} = 0.0141; 


2| = 1 or 2. 


Step 3: Termination 


P* = max{¢y3[1], y3[2]} = max{0.0370, 0.0141} = 0.0370; 


q3 = argmax{ [1], y3[2]} = 1; 


Step 4: State sequence backtracking 


Hence 


&, = Vase lGaa ih = b= 1D = 2,045 13 


g = ¥3(1) = 1; 
qi = %2(1) = 1. 


So the most likely state sequence for observing HTT is 1, 1, 1. 
Were we to repeat the calculation for the observation HHH, we would have found the 
most likely sequence to be 2, 2, 2. We leave the verification of this result as an exercise for 


the reader. 


11.6 SPECTRAL 


ESTIMATION 


In Chapter 5, we estimated the means and correlations (or covariances) of random vectors 
and later random sequences in Section 8.8. In the special case of a WSS random sequence, 
we employed the ergodic hypothesis (cf. Section 10.4) to provide an asymptotically exact 
estimate of the mean and correlation function as the number of samples tends to infinity. 
Here, in this section, we show how to make correspondingly good estimates of the psd’s of 
random sequences. 
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Basically there are two classical approaches to power spectral density estimation. One 
can either first estimate the correlation function and then calculate the psd estimate, most 
notably with the discrete Fourier transform, as the Fourier transform of the correlation 
function estimate. Or, one can estimate the psd directly from the Fourier transform of the 
data. The conceptually simplest approach here is to take the magnitude squared of the 
Fourier transform of the data record, also known as the periodogram. While this simple 
periodogram estimate is often used in practice, we will see below that its large variance 
makes it quite a noisy psd estimator. 


The Periodogram 


We start with a WSS and zero-mean random sequence X[n] observed over —oo <n < +00. 
We must additionally assume the sequence is ergodic in correlation. We use the window 


function wy[n] 41 for 0 <n< N-—1 and 0 elsewhere, to cut out a finite-length section: 
A 
Xy[n] = wn [n]X[n]. 


We then estimate the correlation function Rx x|m] by the time average 
Ry(m] = =Xyn|[m] * Xn [—m]. (11.6-1) 


Since the random sequence is ergodic in correlation (cf. Definition 10.4-3 for the random 
process version), say in the mean-square (m.s.) sense, then this estimate should converge to 
the ensemble average correlation function Rx x|m] in the m.s. sense, that is, 

lm Ry(m]=Rxx[m| — (ms.) (11.6-2) 

N-oo 

based on discrete-time versions of the ergodic arguments used in Section 10.4. In particular 
see Theorem 10.4-2 and Problem 11.12, which works out the details in the case of a Gaussian 
random sequence. 

Since the psd Sx x (w) is the Fourier transform of the correlation function Rx x [mJ], one 
might hope that the Fourier transform of Ry [m] would be a good estimate of Sx x(w). We 
investigate this possibility by evaluating the mean and variance of this psd estimate. We 
hope that the mean of this psd estimate, called an unbiased estimate, will be correct and 
that the variance of the psd estimator will decrease toward zero as N — oo, a so-called 
consistent estimate. 

Denote the Fourier transform of Ry[m] as Iy(w); then we have 


In(w) = FT{Rn|m]} 


+N 


= S> Rylm)exp(—jwm) 
m=—N 


= ye ~ pp X([n+ mt) exp(—jwm) 


m 


= a1Xw(u)P, (11.6-3) 
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where 
Kea Prin inl 


In words Xy(w) is the Fourier transform of N samples of the WSS random sequence X [n]. 
This quantity can thus be well approximated by any of the fast DFT routines [11-18] 
normally available on scientific workstations. 

Calculating the mean value of the periodogram Iy(w), we obtain 


+(N-1) - 
ElIn(w)] = £ S> Ry[m]exp(—jwm) 
m=—(N-1) 


+(N-1) 


= S\ B[Rybnl]exp(—jum) 
m=—(N-1) 


+(N-1) 


= > (Se) Rxx|mjexp(—jwm), (11.6-4) 


m=—(N-1) 


so, as N — oo, we would expect the mean to be asymptotically correct, that is, Sx x(w), 
if |Rxx[m]| tends to zero fast enough as |m| — oo. It is easy to see that the precise 
criterion needed is }>|m||Rxx[m]| < oo. Then the mean of the periodogram Iy(w) will 
be asymptotically correct, that is the periodogram estimate will be asymptotically 
unbiased: 

lim E[In(w)] = Sxx(w). 

N-oo 


We note in passing, that the random sequence X[n] must have zero-mean to satisfy the 
convergence criterion )>|m|| Rx x[m]| < oo. If the mean is not zero, then there is an impulse 
at w = 0, which must be handled separately. 

We can also look at Equation 11.6-4 as introducing a correlation window as a multiplier 
on Rxx; then by the well-known correspondence of convolution and multiplication, we 


obtain P . : 
rrsh-nest= 2 fs (BN) 


—w 
where the triangular window function, 


fil =i Ounial 


has as its Fourier transform, the square of the periodic sinc function, as shown. 

We now turn to the evaluation of the variance of the periodogram Iy(w). Unfortunately, 
we find that the variance does not tend to zero as N approaches infinity. In fact the variance 
does not even get small. Under certain assumptions to be detailed below, the asymptotic 


+The deterministic autocorrelation operator is denoted by the ‘@’ symbol 2[n] @y[n] 4 
eee oo 21k] yl + Kl. 


k=—oo 
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variance of the periodogram estimator is lower bounded by the square of the power spectral 
density! Thus, Iy(w) turns out to be quite a bad estimate of the psd. 

To calculate the variance we proceed as follows: Employing Var[Iy(w)] = E[I?,(w)] — 
E?[In(w)], we first calculate 


ELI (w |= 2 LOL Lew wy(mlwn|plwn[q] 


x ELX[n]X*[m]X [p].X* [ql]e 72 Pe Fe P-) 


Next we invoke the Gaussian assumption and use the fourth-moment property of the Gaus- 
sian distribution (ref. Problem 5.31) to obtain 


2 


7 . De » wn [njwn [gq] Rxx[n — gle f2"-@ 


n 


Thus, the variance is just given by the last two terms. Converting them to the frequency 
domain, we can then write 


2 


Var[ In (w)] = |— Sxx(A)Ww(d + w)W (A —w)dd 


t i 1 Oe 
a Sxx(A) 7 |Ww(A— #)| OA) 4 (11.6-5) 


TT 


where Wy(w) SF T{wy|n]}. For sufficiently large N, the first term will be near zero 
when w is not near zero, while the second term will tend to E?[In(w)] + S}y(w). We 
thus conclude that the periodogram is not a very accurate estimator of the psd, with its 
asymptotic variance satisfying 


Var[Iy-(w)] © S%y(w). (11.6-6) 


While the mean of the periodogram is asymptotically correct, we see that the standard 
deviation of this spectral estimate is as big as the psd! 


Bartlett's Procedure—Averaging Periodograms' 


Since we assume that the random sequence X [n] is available for time [0, N —1], we can form 
several or many periodogram estimates from different segments of the random sequence, 


that is, let MM divide N and set kK SN /M. Then form K segments of length M as follows: 


¥See reference [11-19]. 
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X'[n] 2 X[n+iM—M] where 0<n<M-1, 
and 1<i<K. 


We can now form K periodograms whose statistical errors will be largely uncorrelated; then 
by performing simple averaging on these periodograms we should obtain a usable estimate 
of the power spectral density Sy x(w). We define the K periodograms as follows: 


M-1 2 


Tiglw| = Mu S$” X'[njexp(-jwn)| , 1 Si< K. 


n=0 


Then define the averaged periodogram estimate of the psd as 
Al Pek 
Bx(w) = K > Ti? (w). 
i=1 


Clearly this averaging will not disturb the asymptotic unbiasedness of the individual peri- 
odograms, although much more data may now be required to ameliorate the windowing 
effects due to the smaller window length M used here. However, a real advantage will 
occur in the variance reduction of the Bartlett estimate due to its averaging of individual 
periodogram estimates of nonoverlapping data segments. 

Calculating the mean of Bx (w), we obtain 


E[Bx(w)] = BUD @)] 


— L /sin|(w— v)M/2)\? 
— Sxx(w) 7 ( sin[(w — v)/2| i 


27 Jin 


— Sxx(w) as Mow. 


Using Equation 11.6-6, the approximate variance of the Bartlett estimator of the psd can 
be found as 


2 


Var[Bx(w)] & Var Z4? (0) 


1 
aSkx(w), (11.6-7) 
as long as the segment length MM = N/K is still large enough for Equation 11.6-6 to hold 


with M in place of N. 


2 


Example 11.6-1 
(numerical illustration of Bartlett’s procedure) This example uses MATLAB to illustrate the 
periodogram and Bartlett estimator for a 1024-element sample of simulated white Gaussian 
noise W[n]. The plots in this example are generated by the m-file titled Bartlett .m that 
is located at this book’s Web site. 

The function rand is first called to generate a 1000-element vector with the Gaussian or 
Normal pdf. The mean is zero and the variance is one, that is, 7, = 1. The periodogram is 
calculated via Equation 11.6-3 using the MATLAB fft function with N = 1024. The resulting 
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Figure 11.6-1  Periodogram with N = 1024. 


periodogram is shown in Figure 11.6-1. The horizontal axis is numbered corresponding to 
uniform samples of w going from 0 to 27, thus the highest discrete-time frequency w = 7 is 
in the middle of the plot. We note that the variance of this periodogram seems quite large, 
with spikes going up almost to seven times the true value of the psd Sww(w) = of, =1. 

Next the 1024-point data vector is broken down successively, first into four 256-point 
data vectors and then sixteen 64-point data vectors. The Bartlett procedure is then carried 
out for K = 4 and K = 16, respectively, by Bartlett.m. The smoother K = 16 result is 
shown in Figure 11.6-2. In these cases the value of M decreases to M = 256 and M = 64, 
respectively. This is also the number of samples of w provided by the numerical routine 
fft in these cases, again with the Nyquist frequency sample in the middle of the range. As 
is common with MATLAB plots, the vertical range has automatically re-sized to match the 
range of the data. However, with reference to the vertical axis we can see that the Bartlett 
estimate has the expected reduction in variance (cf. Equation 11.6-7). The reader should 
note that we are trading frequency resolution in some sense for this improved statistical 
behavior. If the true psd were not white, that is, not flat, then this disadvantage of lost 
resolution would show up. 


Example 11.6-2 
(continuation of Example 11.6-1) Here the same Bartlett procedure is used to estimate a 
nonwhite power spectral density (cf. Figure 11.6-5). A 1024-point sequence generated by 
MATLAB is used to simulate an autoregressive moving average (ARMA) random sequence. 
This sequence is the output of a filter.m whose input was Normal white noise, generated 
by the routine randn. Then filter.m was applied with numerator coefficient vector b = 
[1.0 — 0.8 — 0.1] and denominator coefficient vector a = [1.0 —1.2 +0.4]. The corresponding 
transfer function, H(z) = (1—0.82~1—0.12~?)/(1—1.2271+0.42~7), then gives the ARMA 
power spectral density: 
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Note reduced vertical scale. 
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Figure 11.6-2 The Bartlett type estimate for K = 16. 
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Figure 11.6-3 Periodogram estimate for ARMA model. 


S(w) = o?|H(e%? |? 
- 1.65 — 1.44 cosw — 0.2 cos 2w 
2.60 — 3.36 cosw + 0.8 cos 2w 


The periodogram and Bartlett (K = 16) spectral estimates are plotted in Figures 11.6-3 
and 11.6-4. Note the reduced noise in the Bartlett K = 16 estimate, as well as the reduced 
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Figure 11.6-4 Bartlett (K = 16) estimate for ARMA model. 
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Figure 11.6-5 The true ARMA power spectral density. 


resolution evident in the broadening of the spectral density peak in comparison with the 
true psd shown in Figure 11.6-5. 

This example is programmed in the MATLAB file specest.m located at this book’s Web 
site. You can edit this file and run other cases with different ARMA coefficient values. Also 
by repeatedly calling this .m file, with the “state = 0” command removed, you can see the 
effect on these spectral estimates of different sample sequences of the driving white noise at 
the input to the filter. 
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Parametric Spectral Estimate 


The Markov random sequence was studied in Section 8.5. There in Example 8.5-2, we saw 
that a first-order linear difference equation, driven by an i.i.d. random sequence, gener- 
ated a Markov random sequence. A Markoy-p, or pth order Markov sequence, likewise 
can be generated by passing an i.i.d. sequence through a pth order difference equation (ref. 
Definition 8.5-2). If we are willing to model our observed random sequence by such a model, 
then we can estimate its psd by estimating the parameters of the model. For this model, 
the best linear predictor will yield the model parameters. In Section 11.1 it is shown how 
to determine this predictor. 
Consider the Markov-p model 


P 
X[n] = 50 a,X[n — k] + Win). 
k=1 
Take X and W as zero-mean and write the variance Var{W [n]} as o7,. Let the vector X in 


Theorem 11.1-2 be given as the p-dimensional vector X & (X[n—1], X[n—2],..., X[n—p])”. 
Then Equation 11.1-9 provides the linear prediction estimate of the scalar random variable 


Y = X[n] in terms of the ax. Expressed in vector form as a = (a1,02,...,@p)", these 
coefficients are then determined as the solution to the orthogonality equations, given by 
Equation 11.1-10, repeated here for convenience, 


tT. -1 
a, = kyxKyx. (11.6-8) 
The cross-covariance vector and covariance matrix then become 


kyx = (Kxx[l], Kxx[2],...,Kxxlp])’, 


and 
Kxx[0] Kxx(lJ ve Kxx|p— 1] 
Kxx-| xxi Kxxl0) 
: Sy . Kxx[1] 
Rexl(p=1)) Kxx[-1] — Kxx[(0] 


To obtain a simple parametric psd estimate, we can just replace the covariance function 
Kx x [m] in the above equations with its estimate provided by Equation 11.6-1.1 The solution 
will then yield parameter estimates @], @2,..., @, so that the psd estimate can be written as: 


—_ 


2 
Po Oo 
S = a ; 11.6-9 
xxl) = TP aeexp(—Juk)P es 


It is interesting to note that the AR parametric spectral estimate of the psd has the 
covariance (correlation) matching property, that is, 


IFT{Sxx(w)}=Rxx{m], [ml <p. 


+Since the mean is zero here, the correlation and covariance functions are equal. 
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We note that if the Re x[mJ] are sufficiently accurate, then S xx(w) can be quite close to 
the true psd Sx x(w), if the random sequence is Markov-p. However, if the true random 
sequence X[n] is not Markov-p, than the opposite can occur. 

Such parametric spectral estimates can provide greater resolution of closely spaced 
spectral components than can the classical methods. This may be an advantage when the 
amount of data is small. On the other hand, one pays the price of greater sensitivity to 
model assumptions. 


Example 11.6-3 
(effect of using the wrong model in estimating the psd) This example uses the same ARMA 
random sequence as was used in Example 11.6-2. Here, however, we use an AR spectral 
density estimate and investigate the effect of increasing the predictor order for p = 2, 3, 
and 4. Since the data is not Markov (i.e., the psd is not AR), we do not expect to get the 
precise psd as the data length N — oo. However, for large N, and here we use N = 512, 
we do expect that for a sufficiently large AR predictor order p, we will get an accurate 
estimate. 

The correlation function estimate (or covariance function since the involved data is 
generated with zero-mean) is given as 


7 1 Nel 
Ry|m] = N > yln]yln + mJ. 
n=0 


Then the estimated value of parameter vector a is determined via Equation 11.6-8. These 
vectors for p = 2, and 4, are then plugged into Equation 11.6-9 to yield the resulting psd 
plots of Figures 11.6-6 and 11.6-7, respectively. Each plot also shows the true ARMA psd 
plotted as dashed line for ease of comparison. Note that as the predictor order p increases, 


AR2 estimate and true ARMA psd 
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Figure 11.6-6 The AR2 spectral estimate for the ARMA model. 
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AR4 estimate and true ARMA psd 
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Figure 11.6-7 The AR4 spectral estimate for ARMA model. 
the approximation to the true ARMA psd gets better. Note also that this estimate appears 


to have better resolution than the Bartlett estimator in that it can better match the narrow 
spectral peaks of the true psd. 


Maximum Entropy Spectral Density 


The maximum entropy spectral density is not really a spectral estimate at all! It is simply 
the psd that has maximum entropy while agreeing with a certain number of given corre- 
lation values. Entropy is a concept borrowed from information theory’ that measures the 
uncertainty in a random quantity. This uncertainty manifests itself in relatively flat spectral 
densities with a minimum of narrow peaks. Thus, we can think of the maximum entropy psd 
as the flattest one that agrees with the measured correlations. Certainly, this is a conser- 
vative choice, if we are looking for high resolution of closely spaced spectral components. 
The surprising result of such an approach, however, is that the maximum entropy psd has 
much higher resolution for closely spaced spectral peaks than do the classical methods. 
The resolution of this seeming paradox is that the classical techniques are not correlation 
matching. The maximum entropy psd is the flattest spectrum that satisfies the constraint 
of matching the known (exact) correlation data. 

It turns out that the Gaussian random sequence has the largest entropy of all random 
sequences with a given correlation function, so we can restrict attention to Gaussian random 
sequences. For such a sequence X[n] with psd Sxx(w), the continuous entropy can be 
expressed as 


+ Actually, in information theory, this quantity is called continuous entropy. Simple entropy would be 
infinite in this case of continuous random variables. 
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+1 
A(X) = ral log S'x.x (w)dw. 


We wish to maximize this quantity subject to the constraint that the inverse Fourier 
transform of S'x x(w) agree with the correlation values Rx x[m] = 1m, for m= —p,...,+p. 
We can write the resulting constrained maximization problem in Lagrange multiplier form 
by introducing the variables A_p,...,A+p as follows: 


1 +1 +p 
A= > i log Sxx(w)dw - S* Amrm: 


—T m=—p 


Consider the partial derivatives of A with respect to the ry, 


a tf 
Orm 20 J_, Sxx(w) 


exp(—jwm)dw — Am. 


Upon setting these partial derivatives to zero for all m > p, we obtain the result that the 
inverse Fourier transform of Bye (w) equals the FIR sequence A_p,...,Ao,--+,Ap, that is, 
—1 = Am) |m| < P, 
TET{Sz3(w)}= {9m Pal 
Then, by taking the Fourier transform of the 2p + 1 point sequence A,,, and solving for 
Sxx(w), we can obtain 
1 
Sxx(w) = - |w| <7. 
es Am exp(—jwm) 


Incidentally, we assume that \,, = A*,,, as required to make Sx x(w) real-valued. Finally 
we must solve for the \,, to match the known correlation values r,,. To see when this 
is possible, remember that the Markov-p random sequence has an all-pole power spectral 
density of this same form. Further the coefficients were related to the correlation values by 
the so-called Normal or Yule-Walker Equation 11.1-10; see also Equation 11.2-5. Thus, by 


writing 
2 


+P rt +P 
s Am exp(—jwm) = — |1 » Am exp(—jwm)| , 
mM=—p G m=1 


2 


we can find the required coefficients a,, and 07 


written explicitly as: 


as the solution to the equations Rpa =r 


TO ry r2 . . Tp-1 ay ry 
r_-1 TO ry r2 : : a2 T2 
= = TO Fa = 
~~ b 
dee Pad T2 
. TO ry 
T_(p—1) z z E268 “Re TO ap lp 
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The solution, 
a=R, 1p, 
is guaranteed to exist whenever the correlation matrix R, is positive definite. 

We see that, practically speaking, the maximum entropy psd is just the AR spectral 
estimate that would be generated if the correlation (or covariance) estimated values were 
the exact values. The following example shows how the maximum entropy spectral estimate 
varies with assumed order p. It is very similar to the previous example on AR spectral 
estimates, but here the true values are used for the correlation (or covariance) values needed. 


Example 11.6-4 
(effect of model order on ME estimation of the psd) The purpose of this example is to 
investigate the effect of increasing assumed model order p on the maximum entropy spectral 
density estimate. For each p, we denote the resulting spectral density as MEM(p). This 
example was computed using the four MATLAB program files mem_p.m, for p = 1 through 
4, located at this book’s Web site. We use the ar(3) model with parameters indicated by 
the MATLAB vector a = [1. —1.7 1.53 —.648]. You can run these four .m files yourself to 
experiment with other AR and ARMA correlation models. We determined the following 
values for various orders: 


1.700 
1.700 
1.2215 1.53 
ape Bear Fic > | 0.648 |’ 
0.00 
p= 1, 2, 3, A 


Note that starting with MEM(3), the ao values are numerically correct. Figure 11.6-8 shows 
the corresponding psd plots for p = 2 and 3. We see the MEM(2) psd only detects the 
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Figure 11.6-8 True (dashed) and MEM2 (solid) psd’s for AR(3) model. 
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larger peak at w = 7/3, while the MEM(3) of Figure 11.6-8 psd just overplots the true 
ar(3) spectral density at this plotting resolution. Thus, in this example at least, there is no 
numerical instability associated with choosing too high a model order. In real cases, and 
where the correlation data is not so exact, there are practical problems encountered when 
the model order is chosen too high for the data. 


11.7 SIMULATED ANNEALING 


Simulated annealing is a powerful stochastic technique for optimization. It can be applied to 
both deterministic and random optimization problems, seeking either maxima or minima. 
Here we look at its use for the problem of maximum a posteriori’ probability (MAP) 
estimation for Markov random sequences, where the goal is to maximize the conditional 
probability of the signal given noisy observations. In general, such optimization problems 
have many local maxima so that a simple hill climbing algorithm like steepest ascent will not 
be effective. Simply speaking, simulated annealing (SA) avoids ending at a local maximum, 
by following an iterative stochastic procedure that samples the a posteriori pdf (or PMF) at 
a huge number of points, near both local maxima and the global maximum. As the iteration 
proceeds, a parameter termed temperature which governs the randomness in the a posteriori 
pdf is slowly reduced. This causes the estimator to spend more and more of its time near 
the global maximum. If the temperature is reduced toward zero slowly enough, the MAP 
estimate is obtained. 

SA is a stochastic procedure because the steps are random and are obtained as samples 
of the conditional pdf at each time (or spatial location). SA is not needed for the Gaussian 
Markov random sequences studied in earlier sections, where Wiener and Kalman filters yield 
the MMSE estimates very efficiently. In the Gaussian case, the MMSE and MAP estimates 
are the same since the MMSE estimate is the conditional mean of the a posteriori pdf, 
and being Gaussian this conditional mean is the peak. For more complicated, especially 
compound (sometimes called doubly stochastic) Markov models, the solution is non linear 
and usually cannot be obtained in closed form, so that some form of iterative procedure is 
then required to find either the MMSE or the MAP estimate. 

Physical annealing describes the process of gradually cooling a liquid in such as way 
that large scale crystal-like regions form as the temperature is reduced, that is, the material 
freezes. The freezing occurs at a critical temperature called the Curie point. A similar effect 
is observed for so-called ferromagnetic material when placed in a strong magnetic field. In 
this material, the internal magnetic dipoles align with an imposed magnetic field, more 
and more, as the temperature is gradually reduced, and the material becomes magnetized. 
The Ising model [11-20] is a simple Markov chain model that has been able to capture 
the essential freezing characteristics of magnetization in ferromagnetic solids. SA is an 


+The Latin phrase a posteriori simply means “afterwards,” in this case after the noisy observations. 
The complementary term a priori means “before.” The a priori pdf would then be the original pdf of the 
noise-free signal. 
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optimization method that has, because of some similarities, adopted the terminology of 
physical annealing. Another name for SA is stochastic relaxation. It is a modern and efficient 
way of doing Monte Carlo simulation. In SA the variances in both the random signal model 
and observation noise are multiplied by T and then this parameter is slowly reduced to 
simulate annealing. 

We will apply SA to the problem of finding the MAP estimate for Markov random 
sequences. However, its main application area has been in image processing, to two- 
dimensional random sequences called random fields. In particular, Markov random fields 
often exhibit two properties that require SA. First they are often noncausal, so that the 
Kalman recursive-in-time scheme cannot be used to get the MMSE estimate even if the data 
is Gaussian. Secondly, the Markov random fields of interest in image processing, are often 
compound with resultant nonGaussianity, so that the MMSE estimator is not linear, ruling 
out a simple two-dimensional extension of the Wiener filter. On the other hand, extension 
of the one-dimensional SA method for noncausal random sequences, to such spatial and 
even spatiotemporal processes is straightforward. 


Gibbs Sampler 


The Gibbs sampler [11-21] is a stochastic iterative procedure which samples the condi- 
tional pdf (or PMF) of the signal given both the noisy observations and neighboring, but 


prior signal estimates. Assume there are noisy observations of the N-dimensional vector 


x4 (X[0], X[2],..., X[N —1])” available in the observation vector Y S (Y [0], Y[2],... 


Y[N — 1])". We define the MAP estimate of X then as 


3 


XK wap(Y) Sarg max f(x|Y). 


The Gibbs sampler does not sample the a posteriori pdf f(x| Y) directly, as this would 
be difficult for large N. Rather it samples the related conditional pdf f(«[n]|xn, Y), where 
Xn = (z(0],...,2[n — 1],a[n + 1],...,2[N — 1]) is the so-called deleted vector obtained 
be removing x[n]. The sampling is usually implemented in sequence, sweeping through the 
N lattice points (time or space) to complete one step of the iteration. Repeated sweeps 
with the Gibbs sampler thus generate a sequence of estimates X[k] for increasing k, where 
k is the iteration number that under certain conditions can be shown to converge to the 
MAP estimate X MAP. In fact, the sequence of vector estimates X[k] can be shown to bea 
Markov random sequence itself [11-21]. As part of the Gibbs sampling procedure, variances 
are multiplied by the parameter 7’, called the temperature, which gradually reduces the 
variance of this conditional density as T — 0, finally resulting in a freezing at hopefully the 
global optimum value of the objective function. We denote the modified conditional pdf as 
fr. Since the variance of the distribution is decreasing with temperature, the local maxima 
of fr(x|Y) are decreasing with respect to the global maximum fr(X map|Y). This means 
that near T = 0, the conditional pdf should be nearly an impulse centered at x = x MAP(Y). 
Here is the basic algorithm for the Gibbs sampler. 


1. Set iteration number k = 1 and temperature T[k] = 1, max number of iterations k, 
and initialize the signal estimate X[0]. 
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2. Using temperature Tk], at each site n, sample the conditional pdf fr(a[n]|xn, Y), 
where x, is the vector of samples of estimates computed up to this time. After we 
complete this sweep of all N samples, call the new signal estimate vector X[k). 

3. Set k =k-+1 and go to step 1 until k > ky or convergence criteria are met. 


Below, we will see how to obtain the conditional pdf fr(a[n]|xn,Y) from which we 
obtain the new signal estimate X[k] from the prior estimate X[k — 1] in the important 
case of compound Gauss—Markov random sequence. The rate at which the temperature 
T[k] is reduced over time is crucial and is called the annealing schedule. Proofs of conver- 
gence of SA algorithms [11-21], [11-22] depend on how slowly the temperature reduces. 
Unfortunately, practical applications usually require a faster reduction than the logarithmic 
annealing schedule appearing in key proofs. Still simulated annealing with the Gibbs sampler 
often results in much improved estimates in practice for compound Gauss—Markov models. 
A concrete example of how this is done appears below after introduction of noncausal 
Gauss—Markov random sequences. 


Noncausal Gauss—Markov Models 


In earlier work in this book, we have looked at the Gauss—-Markov model and seen that it is 
modeled by white Gaussian noise input to an all-pole filter, so that the resulting difference 
equation, in the WSS case, is 


X[n| = ~ ayX(n — k] + W[nl, —oo <n < +00, (11.7-1) 
k=1 


with power spectral density Sx x(w) = of /|1—VR_, axe J**|?. Here W[n] is the instanta- 
neous prediction error that has the MMSE property. This is a causal model, meaning that 
the sequence X[n] can be recursively generated with increasing n, from some starting point 
(here assumed at —oo). We can generate the same random sequence from the noncausal 
linear model? 


Pp 
X[n}= Seg X[n—k] +U[n], — -00 < n< +00, (11.7-2) 
k=—p, k£0 


where the c,’s are the linear interpolator coefficients and the random sequence U[n] is the 
interpolation error that has the MMSE property. U[n] has a neighborhood limited (i.e., 
finite support) correlation function 


a, m=0, 
Ryu|m] = § —cmozZ, 0<|m| <p, (11.7-3) 
0, else. 


+This is the same representation for Markov random sequences that was found in Section 11.6 for the 
Maximum Entropy spectrum. 
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We can derive Equation 11.7-2, by multiplying out the denominator of S'xx(w) above to 
get the noncausal model and the cz values. We find explicitly that 


P P 
Sxx(w) = | 1- x. exe 1 [ot =|1- » ape IF 
k=1 


k=—p, k#0 

Correlation models of the form in Equation 11.7-3 have finite support, so that beyond a 
distance p, there is exactly zero correlation. Thus, unlike the input W[n] in a causal Markov 
model, the input U[n] in a noncausal Markov random sequence is not a white noise. 

The reader may wonder why we introduce equivalent, but noncausal models. The reason 
is so that we can demonstrate SA, in particular the Gibbs sampler, without going to two- 
dimensions. In two and higher dimensions, the noncausal model equation does not factor 
simply as in Equation 11.7-4. It also turns out that the noncausal model is the more general 
of the two, and the basis for defining the Markov random field in two and higher dimensions. 

The general definition of Markov random sequences (and fields) is given in terms of 
their interpolative conditional pdf fx(@n|@n4p,---,@n41, En—1;+-+,Ln—p)- 


2 


[ot (11.7-4) 


Definition 11.7-1 A random sequence X [n] defined on [0, N—1] is said to be Markov- 
p, if for all n, 


fx (@n|x0, seeyDn—1,Un4+15--- ,<N) = fx(2n|@n+p; seeyUn4t1,En-15++- Crap) 


If n+ p are outside [0, N — 1] then use boundary conditions or reduce the model order near 
these boundaries. [i 


We will use this general definition of Markov random sequence in the sequel. In the 
two-dimensional, finite lattice case, where factorization of the noncausal representation is 
not possible, there was the problem of how to relate this conditional representation to the 
joint pdf fx(x), where X is the random vector of all the sample random variables X [n]. 
In [11-23], Besag showed how to relate the two representations, that is, conditional and 
joint probabilities, in a theorem of Hammersley and Clifford. For the discrete-valued case 
on a finite lattice (or bounded region), the Hammersley—Clifford theorem expresses the joint 
Markov PMF Px(x), in terms of an energy function U(x) (not related to U[n] above!) as 


Px(x) = kexp{—U(x)}, 


with the assumption that Px(x) > 0 for all x, and with & being a normalizing constant. The 
non-negative energy function U(x) is in turn defined in terms of potential functions that are 
summed over cliques, which are single sites and neighbor pairs in the local neighborhood 
regions of the Markov random sequence, 


U(x) =~ V.(x). 
ceC 


Here the sum is over each site or location n, and potentials associated with that site. 
This general Markov random sequence representation also applies for continuous valued 
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random sequences on finite lattices, again with the constraint that fx(x) > 0 for all x. 
Note that the energy function is a sum over local characteristics expressed in terms of 
the potential functions. We see that the maximum probability occurs at the minimum 
energy. 


Example 11.7-1 
(cliques and energy functions) For the noncausal Gauss~Markov random sequence of order 
p given in Equation 11.7-2, the cliques c are: the single sites [n] and all the site pairs [n], [k] 
with |n — k| < p, with corresponding potential functions 


x(n] 
V.(x) = Do%, fork=nand V(x) =— a 


Cn—pu|n| z[k] 


fork £n and |n—k| <p. 


Here one is the minimum mean-square interpolation error, that is obtained with interpolation 
coefficients cz. The overall energy U(x) is then obtained by summing the above potential 
functions over all n, and for each n, then for all k #£ n in a local neighborhood such that 
|n — k| < p. In summing over clique pairs [n],[k], we multiply by 1/2 so as not to count 
these site pairs twice. We thus get 


> x(n] e Cn—Ka[n] [ki] 
n—k 
a Q02, 2c? , 
n=0 U kAn,k=n—p U 


which we recognize as the exponential argument of a multidimensional Gaussian random 
vector of Chapter 5. In this case the N x N correlation matrix Rxx is given in inverse 
form, for first order p= 1 and N = 6 case for example, as 


We note that this implies something about the boundary conditions, in particular the 
boundary condition «[—1] = x[N] = 0 is needed here. The fact that different boundary 
conditions are needed for the causal and noncausal Markov random sequences has been 
pointed out by Derin in [11-20]. 


The next example uses the Gibbs sampler to perform an iterative MAP estimate for a 
Gauss—Markov random sequence. While this problem could be more efficiently solved by a 
linear Kalman or Wiener filter, the example serves to illustrate the SA approach. This same 
approach is then used in a later example for a compound random process, for which linear 
estimates are not optimal. 
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Example 11.7-2 
(interpolative (conditional) representation) Consider the following estimation problem. For 
some N > 1, we have observations 


Y[n] = X[n] + V[n] for n =0,...,N—1, 


where the signal vector X S (X[0], X[1],..., X[N —1])” is noncausal Gauss—Markov, and 
the additive noise V[n] is independent white Gaussian noise of variance o7,. We seek the 
MAP estimate Xap = arg max, f(x|Y) using the Gibbs sampler, where Y = (Y[(0], 
Y[lJ,..., Y[N —-1])?. 

To run the Gibbs sampler, and implement the simulated annealing schedule, we modify 
the variances o7, > Toj, and of, > Toj,. With T = 1, we have the true variances, but as 
T — 0, we simulate annealing to converge to the global MAP estimate Xj p(Y). We need 


the conditional pdf 7x[n] & fr(a[n]|xn, Y), where x, is the deleted version of x, and all 
the variances are multiplied by T. Starting out, we write 


mx|n] = fr(z[n||xn, Y) 
[n],Xn, Y) 


fran] 
| frleixn¥)de 
_ fir bo) 
[ fol. x0) fr(¥ bode 
__firulnl xn) rn) fr Bs) 
J fr (eben) felon) fr bode 
= kfr (alr) xn) fr(Y bx), 


= key (Hee ee). (1.75) 


, where the integration is over all values of z = 2[n], 


where many terms not involving x[n] have been absorbed into the constant normalizing k, 


and where the term? ‘ 


(c* x)|n] S- chu[n — ky}. 
k=—p, k#0 
Now recognizing that this is a Gaussian pdf, and completing the square, we finally arrive 
at a Gaussian sampling pdf 7x[n]:N(y, 07) with: 


ae a? 
b= (c* z)[n] 2 7 Y[n| 2 u 2 
yu toy oy toy 
+ We note that for a given set of realizations KX, = Xp (recall that KX, = (X[n—p],...,X[n—1], X[n+ 
lj,...,X[n+p])? and xn = (2[n — p],..., e[n — 1], 2[n + 1],...,2[n + p])”) we have from Equation 11.7-2 
that B{X[n]|xn} = )3, eee cpa[n — k] since E{U[n]} = 0 that is, Ryy[m] “= uz, =0. 
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We can recognize the conditional mean ju as a weighted average of the observed value of Y [nl], 
and the interpolated value, based on the current estimate of the neighborhood of site n. 
Similarly o? is the variance of the error when p is used as a summary (estimator) of this 
distribution. 


While this is an interesting exercise, and as we have said before, SA is not needed for 
one-dimensional Gauss—Markov random sequences, because the MMSE and MAP estimates 
can be calculated recursively by the Kalman filter as we saw in Section 11.2. Since we cannot 
factor general noncausal Markov models in two and higher dimensions, an iterative solution 
is often appropriate, but since there is only one maximum in the a posteriori pdf in the 
Gauss—Markov case, a deterministic version of simulated annealing can be used [11-20], or 
an iterative Wiener filter (cf. Section 11.3) can be used. However, the real importance of SA 
lies in finding the MAP estimate for compound stochastic models, with their characteristic 
of many local maxima of the a posteriori pdf. Below, we consider the case of compound 
Markov random sequences that have been used with success in image processing. To simplify 
the discussion, we will continue to treat only the one-dimensional case. Generalization of 
the SA method to two and higher dimensions is straight forward. 


Compound Markov Models 


Returning to Equation 11.7-2, we now write the coefficients as randomly selected by some 
underlying Markov chain L[{n], called a line sequence (also commonly called line process): 


P 
X[n] = S- ce! X[n — k] + Uppy[n], l<nen, (11.7-6) 
k=—p, k#0 


Here the line process L[n] is interpreted as “bonds” between the data values X[n], and 
is located between the data points, in the time (or spatial) domain. If L = 0, the bond 
is present and there is normal (high) correlation between the neighboring values on either 
side of the bond, but if L = 1, then the bond is broken and little correlation exists. In the 


interpolative model Equation 11.7-6, the interpolation coefficients cpl are either big or 


small, based on the value of L[n] 4 (L[n], L[n + 1]), with temporal (spatial) arrangement as 
shown in Figure 11.7-1. In this Figure, the black dots, indicate the locations (sites) for the 
observed random variables X[n], while the vertical bars indicate the interstitial locations 
or sites of the unobserved bonds or line variables L[n]. We note that the binary random 
vector L[n] contains the two bonds on either side of X[n], so that the value of the cel will 
directly depend on whether these bonds are intact. Note also, that the interpolation error 
Uy [{n] would be expected to be large or small, based on these local bonds. For example 
if all the L[n] = 1 then the interpolation error will be larger since there is no “bonding” 
between points. We model this error here as a conditionally Gaussian pdf, with zero mean, 
and a variance dependent on the line process nearest neighbors L[n]. 
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Figure 11.7-1 Placement of line process with regard to data samples. 


It is convenient to consider the joint MAP estimate here 
(X,L) 4 arg max f(x, 1|Y). 
x, 


As in the previous example, we consider only the first-order case here where the only 
coefficients are cl = cl_,. We can write the joint mixed! pdf f(x,1, Y) =P(1) f(x|l) f(Y|x, }) 
and then we have 


f(x, HY) =kP() f(x|)) f(Y |x, )), 
where the normalization constant k is not a function of x or 1. To proceed further we need 
the PMF of the discrete valued line sequence L, for which we take a one-dimensional version 
of the Gemans’ line sequence [11-21], as shown below. 


Gibbs Line Sequence 


We write the joint probability of the line sequence L = {L[n]} as 
Pl) = kexp(-U() 


s- VAD, 


crEC} 


with energy function 


I> 


U(]) 


where V,,(1) is the potential function at clique cj. We consider four such values illustrated 
by Figure 11.7-2, where the black dots are data value sites, and the line sequence is shown 
between the samples. Four separate cases are shown here, with increasing energy (lower 
probability) from right to left. Here the black line indicates a broken bond, that is, 1 = 1, 
while its absence indicates an intact bond, that is, ] = 0. Note that the potential function 
values V. increase with the number of broken bonds, with two neighboring broken bonds 
being given the highest potential, leading to highest energy, and hence lowest probability. 
The numerical values given were experimentally determined from the data, and have no 
general significance. For a new problem, another set of increasing potential function values 
would have to be carefully selected. 


tNote that this probability is part “density” and part “mass function,” hence the word “mixed.” Thus, 
it must be integrated over x and r, but summed over 1, to give actual probabilities. 


Since f(a,l,/Y) = f(a,l, Y)/f(¥Y) we find that k a 1/f(Y¥). 
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V=0.0 V=8.0 V=8.0 V= 12.0 


Figure 11.7-2 Assignment of Line sequence potential function to line cliques. 


To run the Gibbs sampler for the compound Gauss—Markov sequence, we need the 
conditional PMF of the line sequence, as in Equation 11.7-5 


mn] & P{l[n]\In,x, ¥} 


_  ptx, 1, Y} 
Sols, las ie Y} 
I 


where the deleted vector In = (I[0],...,l[n — 1], U[n + 1],...,I[N — 1])? excludes I[n], and 
the sum in the denominator is over the (two) values of I[n]. After some manipulations, we 
come up with the following expression for 7;[n] in terms of V.,(1[n]) and the neighboring 
data values a[n] and a[n + 1], which is valid for first order models, that is, p = 1, 


2 ality nijc|n x n 
(=e a ee 1) + Va (inl) 


[n] 7i[n] In+1] 


wk | 


m[n] = kexp 


where we have inserted the annealing temperature variable T, which again is T = 1 for the 
actual stochastic model. 


Example 11.7-3 
(simulated annealing for noisy pulse sequence) Here we apply the Gibbs sampler to the 
estimation of 100 points of a somewhat random pulse sequence X[n] shown in Figure 11.7-3 
corrupted by an independent, additive, white, Gaussian noise W[n]. Figure 11.7-4 shows the 
received noisy pulse sequence. The approximate signal variance is 0.245 and the input noise 
variance is 0.009. We see that the additive noise has significantly increased the random- 
ness of the height of the pulses. After SA processing for 500 iterations the mean square 
error is reduced to approximately 0.004, while after linear iterative processing the MSE is 
approximately 0.0065. 

Figure 11.7-5 shows the SA estimate of L[n] + 1, and we note that it has correctly 
detected the jumps in the input random pulse train. Because the bonds are now broken 
across the pulse edges, we expect the SA estimate of the signal X[n] will be smoothed within 
the pulses, but not across the pulse edges. This should result in improved performance. 

Figure 11.7-6 shows the SA estimate obtained with 500 iterations. Note that the observa- 
tion noise W[n] has been attenuated without smoothing the edges of the pulses. Figure 11.7-7 
shows a linear Wiener estimate obtained without aid from the line sequence. Note the 
oversmoothing or blurring at the pulse edges seen here. 
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Figure 11.7-4 Realization of signal plus additive noise Y[n]. 


In a 2-D version of the signal-in-noise estimation problem of the last example, where 
Markov random fields and SA are used for image processing, the use of the compound 
Markov model is essential to avoid, the otherwise visually annoying, blurring of image 
edges. Like the noncompound or Gauss—Markov model in the above 1-D example, a simple 
2-D Gauss—Markov model would give substantial blurring of image edges and an overall out- 
of-focus effect [11-24]. Other applications of SA in image processing include object detection 
and motion estimation between two frames of a video or movie. 
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Figure 11.7-5 SA estimate of L[n]. 
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Figure 11.7-6 SA estimate of signal X[n] after 500 iterations. 
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Figure 11.7-7 Linear iterative estimate of signal X[n]. 


SUMMARY 


This chapter has presented several applications of random sequences and processes in the 
area generally known as statistical signal processing. The reader should note that there are 
many more applications that we did not have room to mention. Notable among these is 
the application of probabilistic and statistical theory to communications, data compression, 
and networking. Having completed this book, it is our belief and hope that the student 
will find other important applications of random sequences and processes in his or her 
future work. 


PROBLEMS 


(*Starred problems are more advanced and may require more work and/or additional 
reading. ) 


11.1 Let X and Y be real-valued random vectors with E[X] = E[Y] = 0 and K, & 
E[XX"™], Ky 2 E[YY™], and Ky 2 E[XY7] = K, 4 (E[YX"))’. It is desired 


to estimate the value of Y from observing the value of X according to the rule 


Y — AX. 


784 


Chapter 11 Applications to Statistical Signal Processing 


*11.4 


Show that with r 
A= A, =K2K;' 
the estimator Y, is a minimum-variance estimator of Y. By minimum variance esti- 


mator is meant that the diagonal terms of £ [(¥ - Y)(¥ —Y)?)] are at a minimum. 
Let X and Y be real-valued random vectors with E[X] = pw,,E[Y] = po, 


E(X — yy)(X — w)7] 2 Ki EY - w)(Y - w,)™] 2 Ke and 
E((X — p,) (Y — py)7] 2 Kiv. Show that 


¥ = py + Kai K7*(K — 4) 


is a linear minimum variance for Y, based on X. 
Use the orthogonality —_ to show that the MMSE 


e” = El(X — E[X|Y])”], 
for real-valued random variables can be expressed as 
e*? = E[X(X — E[X|Y))] 
or as 
= E[X*) — E[E[X|Y)"]. 
Generalize to the case where X and Y are real-valued random vectors, that is, show 
that the MMSE matrix is 
e* = El(X — E[X|Y])(X — E[X|Y])”] 
= E[X(X — E[X|Y])7] = E[XX7] — E[E[X|Y]E7[X|y]]. 


Conclude that the limit in Equation 11.1-17 exists with probability-1 by invoking 
the Martingale Convergence Theorem 8.8-4 applied to the random sequence G/N] 
with parameter N defined in Equation 11.1-18. Specifically show that 02,[N] remains 
uniformly bounded as N — oo. 

Modify Theorem 11.1-3 to specify the LMMSE estimate of the zero-mean random 
sequence X[n] based upon the most recent p observations of the zero-mean random 
sequence Y[n], that is, 


E(X[n]|Y[n],...,Y }= Soa ¥[n- 9. 


i=0 
(a) Write equations analogous to Equations: 11.1-25 and 11.1-26. 


(b) Derive the corresponding equation for <?,,, as in Equation 11.1-27. 


(Larson & Shubert [11-25]) A Gaussian random sequence X[n],n = 0,1,2,... is 
defined by the equation 


xi) =- 90 (* 3?) xin) + wie os ee 


k=1 
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where X [0] = W/[0], and W[nj,n = 0,1,2,... is a Gaussian white noise sequence 
with zero mean and variance of unity. 


(a) Show that W[n] is the innovations sequence for X[n]. 

(b) Show that X[n] = W[n|—3W[n—1]+3W [n—2]—W|n—3], for n = 0,1,2,..., 
where W[-—3] = W[—2] = W[-1] = 0. 

(c) Use the preceding result to obtain the best two-step predictor of X[12] as 
a linear combination of X[0],...,X[10]. Also calculate the resulting mean- 
square prediction error. 


Let W[n] be a sequence of independent, identically distributed Gaussian random 
variables with zero mean and unit variance. Define 


X[n] SW +...+W[n] n=1,2,... 


(a) Find the innovations sequence for X [n]. 
(b) Let there be noisy observations available: 


Y[n] = X[n] + V[n], n=1,2,..., 
where V[n] is also a white Gaussian random sequence with variance o7,, and 


V[n] is orthogonal to W[n]. 
Find the recursive filtering structure for computing the MMSE estimate 


X[n|n] 2 B[X[n]|Y,-..,¥ [nl]. 


(c) Find the recursive equations specifying any unknown constants in the filter 
of (b). Specify the initial conditions. 


A random sequence Y[n],n = 0,1,2,..., satisfies a second-order linear-difference 
equation 

2¥[n+2]+Y[n+1]+Y|n]=2W[n], Yo} =0,Y[1] =1, 
with W[n],n = 0,1,...,astandard white Gaussian random sequence (i.e., N(0,1)). 


Transform this equation into the state-space representation and evaluate the mean 
function f4x[n] and the correlation function Rxx[n1,n2] at least for the first few 
values of n. 
Hint: Define the state vector X[n] = (Y[n + 2], Y[n4+ 1])?. 
In our derivation of the Kalman filter in Section 11.2, we assumed that the Gauss— 
Markov signal model (Equation 11.2-6) was zero-mean. Here we modify the Kalman 
filter to permit the general case of nonzero mean for X[n]. Let the Gauss—Markov 
signal model be 

X[n] = AX[n-—1] + BW[n], n>0 


where X[—1] = 0 and the centered noise W,[n] & W([n] — wn] is white Gaussian 
with variance o%,, and py[n] # 0. The observation equation is still Equation 11.2-7 
and V L W.. 


786 Chapter 11 Applications to Statistical Signal Processing 


(a) Find an expression for #yx[n] and py [n]. 

(b) Show that the MMSE estimate of X[n] equals the sum of x|n] and the 
MMSE estimate of X,[n] a X[n] — px 

A 

Yeln] 2 ¥[n] — wy le. 

(c) Extend the Kalman filtering Equation 11.2-16 to the nonzero mean case by 
using the result of (b). 

(d) How do the gain and error-covariance equations change? 


*11.10 (Larson and Shubert [11-25]) Suppose that the observation equation of the Kalman 
predictor is generalized to 


n| based on the centered observations 


Y[n| = C,X[n] + Vn], 


where the C,,,n = 0,1,2,... are (M x N) matrices, X[n] is a (NV x 1) random vector 
sequence, and Y[n] is a (M x 1) random vector sequence. Let the time varying signal 
model be given as 

X[n] = A,X [n — 1] + B, Win]. 


Repeat the derivation of the Kalman predictor to show that the prediction estimate 
now becomes 


X[n] = A, [(I— Gp_1Cn_1)X[n — 1] + G,_1 ¥[n — 1], 
with Kalman gain 
Gn = €*[nJCh[Cre"[nJC, + o%[n]]?. 


What happens to the equation for the prediction MSE matrix €?[n]? 
11.11 Here we show how to derive Equation 11.3-5 using property (b) of the FE operator 
of Theorem 11.1-4. 
(a) Use property (b) in Theorem 11.1-4 iteratively to conclude, 


+N 
B[Xin|¥(-N],...,¥4NJ= D> olkIVIA, 
K=—N 


with g[k] given by Equation 11.3-4. 
(b) Use the result of Problem 11.4 to show that lim oe n glkly [Kk] exists 
with probability-1 when X and Y are jointly Gaussian. 


11.12 Let X[n],-—co < n < +00, be a WSS random sequence with zero mean. Further 
assume that X is Gaussian distributed. Let Ry[m] be defined as in Equation 11.6-1. 


(a) Show that Ry[m] is an unbiased estimator of Ry{m], that is, E{Ry[m]} = 
Ry|[m], -—co < m < +00. 

(b) Show that Ry[m| > Rxx[m] in mean-square. Hint: Consider E{Ry[m| 
Re [m]}, and use the fourth order moment property of the Gaussian distri- 
bution. 
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11.13 Let the correlation function 
Rxx[m] = 10e7*2!™! 4 ez! 
for Ay > 0 and A» > 0. 
(a) Find Sxx(w) for |w| < 7m. 
(b) Evaluate Equation 11.6-4 and show that 


Nim Et{In(w)} = Sxx(w) 
in this example. 


11.14 Let covariance values for the zero-mean, WSS random sequence X[n] be known for 
m = 0, +1 and given as Kx x[0] = 0% and Kyx[+1] = oXp with |p| < 1. Find the 
maximum entropy psd for this covariance data and corresponding to p= 1. 

11.15 Using the MATLAB file ar_ar.m, investigate the effect of increasing data length N 
on the parametric ar(3) spectral density estimate. Plot the resulting psd estimates 
and the true psd to compare for N = 25,100, and 512. Also plot the correlation 
function estimate for N = 100. 

11.16 (Hidden Markov Model) In Example 11.5-5, assume that the observations at n = 
1, 2,3 are, respectively H,H,H. Use the Viterbi algorithm to show that the optimum 
state-sequence is 


1 =h=G =2. 

11.17 Write a MATLAB.m file for computing the optimal state-sequence for model one 
in Example 11.5-1 and the parameters given in Example 11.5-5. Let the model 
allow for 5 observations, where a 1 represents a Head and a zero represents a Tail. 
Thus, given the observation {H,T,T,H,H}; which is represented by {1,0,0,1,1}, the 
program should compute the state sequence most likely to have produced it. 

11.18 (Expectation-Mazimization Algorithm) Assume a tomographic configuration 
consisting of two Poisson-emitting cells. Let the emission from cells 1 and 2 in 
one second be denoted by X 1, X2 respectively. The detector readings are denoted 
by Yj = 2X1 + 3X2 and Y2 = 3X1 + 2X9. Use the E-M algorithm to find the ML 
estimates of 0; and 62, the Poisson parameters for cells 1 and 2 respectively. (Note 
that this problem can be solved without using the E-M algorithm.) 

11.19 Solve Equation 11.7-4 for the mean square interpolation error of in terms of the 
Gaussian model intepolator coefficients c;, and the mean square prediction error 
ne Assume the equation holds for all time n and use psd’s in the solution. 

11.20 Returning to Equation 11.7-2, consider the deterministic iterative estimate 


+ Y[n] 


oF ou 


2 2 
Oy + Oy 


XDI) So x(*) 
in] 2 (cx X®) ln] ae 


starting at X()[n] = 0 for all n. Show that under the condition >> |c,| < 1, this 
iteration should in the limit achieve the stationary point, 


2 2 
A oy or 
(oe a2, + 02, [r] a2, + 0,’ 


which yields the noncausal Wiener-filter solutions for this problems. 


788 


Chapter 11 Applications to Statistical Signal Processing 


REFERENCES 

11-1. K.S. Miller, Complex Stochastic Processes. Reading, MA; Addison-Wesley, 1974, 
pp. 76-80. 

11-2. J. G. Proakis et al. Advanced Digital Signal Processing. New York: Macmillan, 1992. 

11-3. M. S. Grewal and A. P. Andrews, Kalman Filtering. Upper Saddle River, NJ: 
Prentice-Hall, 1993. 

11-4. R. E. Kalman, “A New Approach to Linear Filtering and Prediction Problems,” 
Journal of Basic Engineering, Vol. 82, March 1960, pp. 35-45. 

11-5. R. E. Kalman and R. S. Bucy, “New Results in Linear Filtering and Prediction 
Theory,” Journal of Basic Engineering, Vol. 83, December, 1961, pp. 95-107. 

11-6. S.M. Kay, Fundamentals of Statistical Signal Processing: Estimation. Upper Saddle 
River, NJ: Prentice-Hall, 1993. 

11-7. T. K. Moon and W. C. Stirling, Mathematical Methods and Algorithms for Signal 
Processing. Upper Saddle River, NJ: Prentice-Hall, 2000, Chapter 13. 

11-8. A.A. Kolmogorov, “Uber die analytichen Methoden in der Wahrscheinlichkeitsrech- 
nung,” Mathematische Annelen, Vol. 104, pp. 415-458, 1931. 

11-9. N. Wiener, The Extrapolation, Interpolation, and Smoothing of Stationary Time 
Series with Engineering Application. New York: Wiley, 1949. 

11-10. R. L. Lagendijk et al. “Identification and Restoration of Noisy Blurred Images Using 
the Expectation—Maximization Algorithm,” [EEE Transactions on Acoustics, Speech 
and Signal Processing, Vol. 38, 1990, pp. 1180-1191. 

11-11. T. K. Moon, “The EM Algorithm in Signal Processing,” [EEE Signal Processing 
Magazine, Vol. 13, November 1996, pp. 47-60. 

11-12. A. K. Katsaggelos ed., Digital Image Restoration. New York: Springer-Verlag, 1989, 
Chapter 6. 

11-13. See Chapter 17 in [9-11]. 

11-14. L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in 
Speech Recognition,” Proceedings of the IEEE, Vol. 77, February 1989, pp. 257-1286. 

11-15. L. R. Rabiner and B-H Juang, Fundamentals of Speech Recognition. Upper Saddle 
River, NJ: Prentice-Hall, 1993. 

11-16. G. A. Frantz and R. H. Wiggins, “Design Case History: Speak and Spell learns to 
talk,” IEEE Spectrum, February 1982, pp. 45-49. 

11-17. G. D. Forney Jr., “The Viterbi Algorithm,” Proceedings of the IEEE. Vol. 61, No. 3, 
March 1978, pp. 268-1278. 

11-18. A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing. Englewood 
Cliffs. NJ: Prentice-Hall, 1989, Chapter 3. 

11-19. G. E. P. Box and G. M. Jeukins, Time Series Analysis: Forecasting and Control. San 
Francisco: Holden Day, 1978. 

11-20. H. Derin and P. A. Kelly, “Discrete-Index Markov-Type Random Processes,” Proceed- 
ings of the IEEE, Vol. 77, October 1989, pp. 1485-1510. 

11-21. S. Geman and D. Geman, “Stochastic Relaxation, Gibbs Distributions, and the 


Bayesian Restoration of Images,” [EEE Transactions Pattern Analysis and Machine 
Intelligence, Vol. PAMI-6, November 1984, pp. 721-741. 


REFERENCES 789 


11-22. F.-C. Jeng and J. W. Woods, “Simulated Annealing in Compound Gaussian Random 
Fields,” IEEE Transactions Information Theory, Vol. IT-36, January 1990, 
pp. 94-107. 

11-23. J. Besag, “Spatial Interaction and the Statistical Analysis of Lattice Systems,” 
Journal of the Royal Statistical Society, series B, Vol. 34, 1974, pp. 192-236. 

11-24. F.-C. Jeng and J. W. Woods, “Compound Gauss-Markov Random Fields for Image 
Estimation,” IEEE Transactions Signal Processing, Vol. 39, March 1991. 

11-25. H. J. Larson and B. O. Shubert, Probabilistic Models in Engineering Sciences: Vol. I, 
New York: John Wiley, 1979. 


IN24N>))@-48 Review of Relevant 
Mathematics 


This section will review the mathematics needed for the study of probability and random 
processes. We start with a review of basic discrete and continuous mathematical concepts. 


A.1 BASIC MATHEMATICS 


We review the concept of sequence and present several examples. We then look at summation 
of sequences. Next the Z-transform is reviewed. 


Sequences 


A sequence is simply a mapping of a set of integers into the set of real or complex numbers. 
Most often the set of integers is the nonnegative integers {n > 0} or the set of all integers 
{-co <n < +oo}. 

An example of a sequence often encountered is the exponential sequence a” for {n > O}, 
which is plotted in Figure A.1-1 for several values of the real number a. Note that for |a| > 1, 
the sequence diverges, while for |a] < 1, the sequence converges to 0. For a = 1, the sequence 
is the constant 1, and for a = —1, the sequence alternates between +1 and —1. 

A related and important sequence is the complex exponential exp(jwn). These sequences 
are eigenfunctions of linear time-invariant systems, which just means that for such a 
system with frequency response function H(w), the response to the input exp(jwn) is just 
H(w) exp(jwn), a scaled version of the input. 
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Figure A.1-1 Plot of exponential sequence for three values of a = 1.05, 1.0, and 0.8. 


Convergence 


A sequence, denoted x[n] or x,, which is defined on the positive integers n > 1, converges 
to a limiting value x if the values x[n] become nearer and nearer to x as n becomes large. 
More precisely, we can say that for any given « > 0, there must exist a value No(e) 
such that for all n > No, we have |a[n] — 2| < ¢. Note that No is allowed to depend 
on €. 


Example A.1-1 
Let the sequence a,, be given as 


Gn == 2” /(2” + 3”), 


and find the limit as n — oo. From observation, we see that the limit is a, = 0. To complete 
the argument, we can then express No(<) from the equation 


gr 


Qn a 37 < 7 


as 


where we assume that 0 < ¢ < 1. We note that for any fixed 0 < ¢ < 1, the value No is 
finite as required. 
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Summations 


Summations of sequences arise quite often in our work. A common sequence used to illustrate 
summation concepts is the geometric sequence a”. The following summation formula can 
be readily derived: Take ng > ny. 


x a” = ——— for al. (A.1-1) 


Of course, when a = 1, the summation is just ng —n, +1. A simple way to see the validity 
of Equation A.1-1 is to first define S = poe a” and then note that, by the special property 
of the geometric sequence, 

aS = S+a™t!~—a™, 


Then, by solving for S, we derive Equation A.1-1 when a £ 1. 
When |a| < 1, the upper limit of summation can be extended to oo to yield 


love) ny 
> a= — for jal <1. (A.1-2) 


nN=N1 


Another useful related summation is: 


oo ni(q— nitl 
S- m= Dare for |a| <1. (A.1-3) 
(1 — a)? 


n=Nn1, 


Equations A.1-2 and A.1-3 most often occur with n; = 0. 


Z-Transform 


This transform is very helpful in solving for various quantities in a linear time-invariant 
system and also for the solution of linear constant-coefficient difference equations. The 
Z-transform is defined for a deterministic sequence x|n] as follows: 


+00 
X(z) = x z|njz—", for z €.%. 


In this equation, the region .% is called the region of convergence and denotes the set of 
complex numbers z for which the transform is defined. This set .# is further specified as 
those z for which the relevant sum converges absolutely, that is, 


+oo 


d- laln]llz- < 00. 


n=—CcoO 


This region .# can be written in general as .# = {z: R_ < |z| < R+}, an annular shaped 
region. The set {z| R_ < |z| < R+} is to be read as “the set of all points z whose magnitude 
(length) is greater than R_ and less than R4.” 


A-4 Appendix A Review of Relevant Mathematics 


Example A.1-2 
Let the discrete-time sequence x[n] be given as the exponential 


x[n] = a” exp(jwon)uln], 


where u[n] denotes the unit step sequence, u[n] = 1 for n > 0 and u[n] = 0 for n < 0. 
Calculating the Z-transform, we get 


X(z) = ‘> a” exp(jwon)z—” 
n=0 


= DOC a me i (A.1-4) 
n=0 
1 


The Z-transform is quite useful in discrete-time signal processing because of the 
following fundamental theorem relating convolution and multiplication of the corresponding 
Z-transforms. 


Theorem A.1-1 Consider the convolution of two absolutely summable sequences 
xn] and hin], which generates a new sequence y[n] as follows: 
too 
yln] = 2 a[m]a[n — m] 
m=—co 
which we denote operationally as y = hx a. Then the Z-transform of y[n] is given in terms 
of the corresponding Z-transforms of x and h as 


Y(z) =H(z)X(z) for z€.4,9 Be. 


Because the two sequences h and « are absolutely summable, their regions of conver- 
gence .#,, and .#, will both include the unit circle of the z-plane, that is, {|z| = 1}. 
Then the Z-transform Y(z) will exist for z € #p,1.%z, which is then guaranteed to be 
nonempty. Hi 


After obtaining the Z-transform of a convolution using this result, one can often take 
the inverse Z-transform to get back the output sequence y[n]. There are several ways to do 
this, including expansion of the Z-transform Y(z) in a power series, doing long division in 
the typical case when Y(z) is a ratio of polynomials in z, and the most powerful method, the 
method of residues. This last method, along with the residue method for inverse Laplace 
transforms, is the topic of Section A.3 of this appendix. 


A.2._ CONTINUOUS MATHEMATICS 


The intent here is to review some ideas from the integral calculus of one- and two-dimensional 
functions of real variables. 
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Definite and Indefinite Integrals 


In a basic calculus course, we study two types of integrals, definite and indefinite: 


it 
[tu = rll +C indefinite, 


: 1 1 
| x?dz = nal _ ze definite. 


In this course we will most always write the definite integral, almost never the indefinite 
integral. This is because we will use integrals to measure specific quantities, not merely to 
determine the class of functions that have a given derivative. Please note the difference 
between these two integrals. Unlike the indefinite integral, the definite integral is a function 
of its upper and lower limits, but not of x itself! Sometimes we refer to x in our definite 
integrals as a “dummy variable” for this reason, that is, x could just as well be replaced by 
another variable, say y, with no change resulting to our definite integral, that is, 


b b 
poe=f ydy. 


To compute the definite integral we first compute the indefinite integral and then subtract 
its evaluation at the lower limit from its evaluation at the upper limit. 

In elementary calculus courses it is often not stressed that definite integrals are oper- 
ations on sets and that there are integrals that are not associated with the “area under a 
curve,” that is, so-called Riemann integrals. Consider the definite integral 


fe [soe 


Here the set of points is {a : a < x < b} and the integral is computed by assigning numerical 
values to the points in an n-partition of the interval (a,b) vis-a-vis Ax = (b— a)/n in the 
set, for example 


Tn (0,0) => f(iAa) x (ide + Ax/2) — 2(iAx Ax/2)) ne | 


noo 


where iAa,iAxv + Av/2 € {x : a< a < b}. If z(x) = x, then I becomes the well-known 
“area under the curve” Riemann integral. But in some cases the Riemann integral won’t 
suffice. For example, consider the expectation operation we encountered in Chapter 4, that 
is, E[X] = [°. xf,(x)dx, which will converge to the desired result if fx (x) is well-defined, 
that is, a bounded function with only a finite set of discontinuities etc. But if fx (a) does 
not fall into this category of functions, we can still compute E[X] from the integral E|X] = 
JOS, edFx (x), where F'x (x) is the CDF of X. This type of integral is called a Stieljes integral 
and is a generalization of the “area under the curve” type integral that is taught in beginning 
calculus courses. For example, if Fx (a) = (1 — e7>*)u(zx), then dFx(x) = (Ae~**)u(x)dx 
and EX] = f>° cre**da = 1/2. 
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Differentiation of Integrals 


From time to time, it becomes necessary to differentiate an integral with respect to a 
parameter which appears either in the upper limit, the lower limit, or the integrand itself: 


d fo) 7 db(y) da(y) ) OF (x,y) 
7 me f(a, y)dz = fOw). Wa - aly). ee Oy dz.  (A.2-1) 


This important formula is derived by recalling that for a function I(b,a,y) where in turn, 
b = b(y) and a = a(y) are two functions of y, we have 


dl Al db . Alda , Al 
dy Obdy' Oady Oy 


b(y) 
| f(x, y)dax 
a(y) 


If we denote 


and define a function F(x, y) such that 


A OF(z,y) 
f(x,y) = a 


then clearly 


dl @ i by) @ 
—_—_ = __ f gL, dx =| eel z, daz. 
557 By dag) TO xf) 


The last step on the right follows from treating b(y) and a(y) as constants, since the variation 
of I arising from its upper and lower limits is already counted by the first two terms. 

An example of use of this formula, which arises in the study of how systems transform 
probability functions, is shown next. 


Example A.2-1 
Consider the example where the function f(x,y) = x + 2y, 


ary y 
al (x + 2y)?dax = (y + 2y)71 — (0+ 2y)70+ | A(x + 2y)dx 
0 0 


1 y 
= (3y)?+4 (52 + 22) 


= (3y)? + 2y? + 8y? = 19y7 


0 
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Integration by Parts 


Integration by parts is a useful technique for explicit calculation of integrals. We write the 


formula as follows: ; 


b 
/ u(x)dv(a) = u(a)v(x)|? -f u(a)du(2), (A.2-2) 


a 
where u and v denote functions of the variable x with the integral extending over the range 
a < «a <b. This formula is derived using the chain rule for derivatives, applied to the 
derivative of the product function u(x)v(a). An example is shown below. Integration by 
parts is useful to extend the class of integrals that are doable analytically. 


Example A.2-2 
Consider the following integration problem: 


[oe) 
/ xe 2" dx 
0 


Let u(x) = x and dv(x) = e~?"dz; then using the above integration by parts formula we 


obtain 
i ve "dr = 2 (-5-*) -{ (-5-*) dx 
0 2 0 0 2 


Completing the Square 


The method of completing the square is applied to the calculation of integrals by trans- 
forming an unknown integral into a known one by turning the argument of its integrand 
into a perfect square. For example, consider making a perfect square out of x? +42. We can 
transform it into the perfect square (x + 2)? by adding and subtracting 4, that is, 


x? + Ag = (a +2)? —4, 


To see how this polynomial concept can be used to calculate integrals, consider the well- 
known Gaussian integral that we often encounter in this course: 


+oo 
/ e723? de = V2. 


—cCo 


If, instead we need to calculate 
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we can do so by completing the square as follows: 


Fa0 1/2 
2 | eo 2(% +4@+4) q> 


—Co 


where we have multiplied by e~? inside the integral and by e? outside. Then we continue, 


+oe 1 2 
= | e7 B(t+2)" dar, 
—co 


With the change of variables y = x + 2, this then becomes 


nyc Lp 2 
= oy e 24 dy 
—co 


=e? V/2r. 


Double Integration 


Integrals on the (a, y) plane are properly called double integrals. The infinitesimal element 
is an area, written as dxdy. We often evaluate these integrals in some order, say x first and 
then y, or vice versa. Then the integral is called an iterated integral. We can write the three 
possible situations as follows: 


fC (ff rose) f° nosy f° (f° tet) 


where the integral in the middle is the true double or area integral. Since limiting oper- 
ations are the basis for any integral, there is actually a question of whether the three 
two-dimensional integrals are always equal. Fortunately, an advanced result in measure 
theory [9-1] shows that when the integrals are defined in the modern Lebesgue sense, then 
all three either exist and are equal, or do not exist. We will consider only the ordinarily 
occurring case where the above three integrals exist and are equal. 

Note that on the left, when we integrate on «x first, that the limits are interchanged 
versus the situation on the right where we integrate in the y direction first. The double 
or area integral in the middle, adopts the notation that one reads the limits in x, y order, 
just as in the function arguments and the area differential dxdy. Thus, there should be no 
confusion in interpreting such expressions as 


3 75 
| | xe "dx dy, 
1 Jo 


since we would read this correctly as an integral over the rectangle with opposite corners 
(x,y) = (1,0) and (#,y) = (3, 5). 


Functions 


A function is a unique mapping from a domain space .4'to a range space Y% The only 
condition is uniqueness which means that only one y goes with each «, that is, f(x) has one 
and only one value. An example is f(a) = x”. A counterexample is f(x) = +,/z. 
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Monotone Functions. A monotone function of the real variable x is one that always 
increases as x increases or always decreases as x increases. The former, with the positive 
slope, is called monotone increasing, while the latter, with the negative slope, is called 
monotone decreasing, as illustrated in Figures A.2-1 and A.2-2. If a function is monotone 
except for some flat regions of zero slope, then we use the terms monotone nondecreasing 
or monotone nonincreasing to describe them, as illustrated in Figure A.2-3. 


Inverse Functions. A function may or may not have an inverse. The inverse function exists 
when the original function has the additional uniqueness property that to each y in Y, there 


corresponds only one x (in .2"). This allows us to define an inverse function f~'(y) to map 


F(x) 


0 x 


Figure A.2-1 Example of a monotone increasing function. 


f(x) 


0 x 


Figure A.2-2. Example of a monotone decreasing function. 


F(x) 


|<flat region>| 


Figure A.2-3 Example of a monotone nonincreasing function. 
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back from Y to .4: We note that a sufficient condition for the inverse function to exist is 
that the original function f(x) is monotone increasing or monotone decreasing. The function 
sketched in Figure A.2-3 does not have an inverse due to the flat section of zero slope. 


A.3 RESIDUE METHOD FOR INVERSE FOURIER TRANSFORMATION 


In Chapters 8 and 9, we defined the power spectral density (psd) S(w) for both discrete and 
continuous time and showed that the psd is central to analyzing LSI systems with random 
sequence and process inputs. We often want to take an inverse transform to find the corre- 
lation function corresponding to a given psd to obtain a time-domain characterization. This 
section summarizes the powerful residue method for accomplishing the necessary inverse 
Fourier transformation. 

We start by recalling the relation between the psd and correlation function for a WSS 
random process, 

+00 
S(w) = R(r)e~3"" dr, 


—co 


i- ge 
R(t) = af. S(w)et3" dr. 
To apply the residue method of complex variable theory [A-3] to the evaluation of the above 
IFT, we must first express this integral as an integral along a contour in the complex s-plane. 
We define a new function S of the complex variable s = 0 + jw as follows. 

First we define S(s) on the imaginary axis in terms of the function of a real variable 


S(w) as 
S(s)|s—jw = Sw). 


Then we replace jw by s to extend the function S(jw) to the entire complex plane. Thus, 


+00 
S(s)|s=ju => S(w) = ‘| R(r)eI"7 dr 
. +00 
S(s) = i. R(r)e7*" dr, (A.3-1) 


which is the two-sided Laplace transform of the correlation function R. Also by inverse 
Fourier transform, 


1 eS ; . 
R(t) = =f SB lejwe aw), 


(A.3-2) 


L ee 
S(s)e*" ds, 


7 279 —joo 


which is an integral along the imaginary axis of the s-plane. 


+This material assumes that the reader is familiar with the discussions in Chapters 8 and 9. 
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The integral in Equation A.3-2 is called a contour integral in the theory of functions 
of a complex variable [A-2] [A-3], where it is shown that one can evaluate such an integral 
over a closed contour by the method of residues. This method is particularly easy to apply 
when the functions are rational; that is, the function is the ratio of two polynomials in s. 
Since this situation often occurs in linear systems whose behavior is modeled by differential 
equations, this method of evaluation can be very useful. We state the main result as a fact 
from the theory of complex variables. 


Fact 


Let F(s) be a function of the complex variable s, which is analytic inside and on a closed 
counterclockwise contour C' except at P poles located inside C’. The contour C' encircles the 
origin. The P poles are located at s = pj,i=1,...,P. Then 


1 
omni P F(s)ds = Pa, Res[F(s); s = pil, (A.3-3) 


Cc 


where 


2. at a second-order pole, Res[F(s);s = p] = 4[F(s)(s — p)]|s=p; and at an nth order 
pole 


3. Res[F(s);s =p] = gy (SSH IF(s)(s—»)")) 


s=p_ 


In applying these results to our problem we first have to close the contour in some fashion. 
If we close the contour with a half-circle of infinite radius C, as shown in Figure A.3-1, then 


Figure A.3-1 Closed contour in left-half of splane for 7 > 0. 
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provided that the function being integrated, S(s)e*’, tends to zero fast enough as |s| — +00, 
the value of the integral will not be changed by this closing of the contour. In other words, 
the integral over the semicircular part of the contour will be zero. The conditions for this 
are |S(s)| stays bounded as |s| > +00, and 

le*7| +0 as Re(s) —- -—~, 
the latter of which is satisfied for all 7 > 0. Thus, for positive 7 we have 


R(r) = _ £ S(s)e*"ds = > Res[S(s)e°"; s = pil, 


Pi 
inside Cy 


Similarly, for 7 < 0 one can show that it is permissible to close the contour to the right as 
shown in Figure A.3-2, in which case we have 


le*""| +0 as Re(s) — +00, 
so that we get 
274 Cr Pi 
inside Cp 
for 7 < 0, the minus sign arising from the clockwise traversal of the contour. 


Example A.3-1 
(first-order psd) Let 


S(w) = 2a/(a? + w?), 0<a<l. 


Figure A.3-2 Closed contour in right-half of s-plane for 7 < 0. 
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Figure A.3-3 Pole-zero diagram. 


Then 
S(8)|s=jw = S(w) = 20/(a? + w?) = 2a:/(jw + a)(—jw + a), 
so 
2a 


“)) = Gpa-ata) 


where the configuration of the poles in the s-plane is shown in Figure A.3-3. 
Evaluating the residues for T > 0, we get 


R(r) = Res[S(s)e*7; s = —a] = 


while for 7 < 0 we get 


R(r) = —Res[S(s)e°7; s = +a 
2ae*7(s — a) 
(s+a)(—s+a) 
_ —2ae*t 
(8 +a)(-1) 


s=+a 


s=a 


Combining the results into a single formula, we get 


R(r) = exp(—a|r}), —00 <T < +00. 


Inverse Fourier Transform for psd of Random Sequence 


In the case of a random sequence one can do a similar contour integral evaluation in the 
complex z-plane. We recall the transform and inverse transform for a sequence: 
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+00 
S(w)= > Rime”, 

a ail 
Ri{m] = oe S(w)etIe™ du. 


We rewrite the latter integral as a contour integral around the unit circle in a complex plane 


by defining the function of a complex variable, S(z)|,—¢i« =o (w), and then substituting z 
for e/* into this new function to obtain the psd as a z-transform, 


+0o 
S(z)= > Rimjz-™ and 
Ri{m] = i f Sten tae where C = {|z| = 1}. (A.3-4) 


In this case the contour is already closed and it encircles the origin in a counterclockwise 
direction, so we can apply Equation A.3-3 directly to obtain 


R{m] = > Res[S(z)z™— 1; z = pil, 
inside (ej 
where the sum is over the residues at the poles inside the unit circle. This formula is valid 
for all values of the integer m; however, it is awkward to evaluate for negative m due to the 
variable-order pole contributed by z™~! at z = 0. Fortunately, a transformation mapping 
z to 1/z conveniently solves this problem, and we have [A-1] 


y] 


1 
Rim] = og te S(z-")a-™ 1 (—2-4 dz), 
= — S(z7))27 de 
2777 C , 


avoiding the variable-order pole for m < 0. We thus arrive at the prescription: 
For m > 0 
Ri{m| = SS Res[S(z)z”™ +; z = pi], 


i:poles 
inside unit circle 


and for m < 0 


Ri{m] = S- Res[S(z-1)z-™—1; z = p,*). 
patsannen circle 


Example A.3-2 
(first-order psd of random sequence) We consider a psd given as 
2(1 — p*) 

(1 + p?) — 2pcosw’ 


S(w) = lw] <7, (A.3-5) 


which is plotted in Figure A.3-4. 
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4 3 2 1 0 1 2 3 4 


Figure A.3-4 Plot of psd S(w) for a p value in (0,1). 


Using the identify cosw = 0.5(exp jw + exp —jw), we can make this substitution in 


3 


Equation A.3-5 to obtain the function of a complex variable, 


7 _ 21 = p?) 
S(Z)|z=e10 = S(w) = (1 + p?) — 2pcosw 
_ 2(1 — p?) 
(1 + p2) — p(eti~ + e-J¥)" 


Then we replace e/” by z to obtain the function of z, 


_ a(L =p") 
2) = GE) pete) 
= 20 Va) 


The z-plane pole-zero configuration of this function is shown in Figure A.3-5. The overall 
transformation from S(w) to S(z) is thus given by the replacement 


cosw — 3(z+ 27"). (A.3-6) 


For m => 0 we get 
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Figure A.3-5 z-plane. 


For m < 0 we have 


R[m] = Res[S(z~*)2-""; z =, 


since z = p~' is the one pole outside the unit circle. 


Now 


S(1/2) = 20" — p) 


z— VE)’ 


which could easily have been foretold from the symmetry evident in Equation A.3-6. Then 


Res|[S(z~*)z-™""; z = p] = —2(p" — p) (z — BV len 
4 (e7' — p)p™ 
° (p= p-) 
= 2p-", 


Combining, we get the overall answer 


Rim] = 2p'™|, —oo <m < +00. 
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A.4 MATHEMATICAL INDUCTION? 


Many proofs in probability are obtained by mathematical induction. Mathematical induction 
is a method for obtaining results, especially proving theorems, which are difficult if not 
impossible to get by any other method. For example: It is claimed that the set S' contains 
all the positive integers. How would we verify this? We could show that 1 € $,2€5,3€ 8S, 
etc. But using this procedure would not allow us to finish in finite time. Instead we can use 
the general principle of matematical induction: 

Let {C;,} be an infinite sequence of propositions, given for all k > 1. We wish to prove 
that these propositions are true for every k > 1. Instead of proving them one by one, we 
rely on the following principle. 


(i) If Cy is true, 
(ii) and for arbitrary k > 1, “Cy, is true” implies “C,41 is true,” 


then C; holds for all k > 1. 


Thus, we only have to perform the two steps (i and ii), using mathematical induction. 
After identifying the indexed set of propositions{C;,} for our particular problem, we first 
show that C; is true. Then we try to show the second step is true. We do this by assuming 
that Cz is true for an arbitrary value of positive index k, and then attempting to show that 
this fact implies that proposition Cy, is true. Then we are finished. 


Example A.4-1 
(mathematical induction) Show that if 0 < a <b then a* < b* for all positive integers n. 


Solution We choose the method of induction. The problem statement that 0 < a < b 
gives us directly the proposition C; = {a < b}, then we let Cy be the set of positive integers 
for which a® < b*, that is, Cy = {a* < b*}. Now assume that C; is true, meaning a’ < b* for 
some k. It then follows that a*t! = axa <axb*® < bx b* = bKt!. Thus, Cy41 is true. The 
principle of mathematical induction then allows us to conclude that all the propositions C;, 
are true, that is, a® < b*, for all positive integers k. 
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Functions 


B.1 GAMMA FUNCTION 


The Gamma function (a), for real a, is defined by the integral [1,2] 
A m a-—1,-t 
T(a) = ) toe" "db; (B.1-1) 
0 


where a > 0. From Equation B.1-1 we see that I'(1) = 1. If we integrate T'(a + 1) by parts 
we obtain 


(a+ 1)= / Me “Ge = -ten4 + af ple Gt 
0 0 0 
= al (a), 
Hence, for positive integer k, 
T(k) = (k—1)! 


For values of the argument between the integers, the gamma function does a smooth inter- 
polation. It is available in MATLAB as the function gamma. 

Therefore, note that 0! = 1. We leave it to the reader to show that T(0.5) = 
Jn and (1.5) = 7/2. The Gamma function is sometimes called the generalized factorial 
function. 


B-1 
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B.2 INCOMPLETE GAMMA FUNCTION 


The (upper) incomplete Gamma function I'(a, x) is defined by the integral 


(ain) 2 f ea, 


where a > 0. The (lower) incomplete Gamma function is defined by 


(a,z) = f Pe de. 
0 


Unless stated otherwise incomplete Gamma function will mean the upper incomplete Gamma 
function. Clearly [(a,0) =T(a). For a =k an integer, the incomplete Gamma function is 
known to satisfy the series [8, 4] 


k-1 j 
agu ne 
I(k, x) = (k— L)le ) Th 
1=0 


which can also be written as 
I'(k, 2) = (k-—1)l(a—1) +2" e* 


and it is available in MATLAB as the function gammainc. This function plays a crucial role 
in evaluating the distribution function of the Poisson random variable. 


B.3 DIRAC DELTA FUNCTION 


The Dirac delta function (a) is often defined as a “function” that is zero everywhere except 
at x = 0, where it is infinite such that 


[. 5(a)dx = 1. 


The mild controversy about regarding 6(”) a “function” in the ordinary sense is partly due 
to it not being of bounded variation and not having bounded energy in any finite-length 
support that contains it. Another definition is to regard 6(x) as the limit of one of several 
pulses. For example, with rectangular window, 


r\ A f1,—-b/2<a2 < 6/2, 
= (5) = ‘ else, 


we can define d(a) as 
(a) 5 lim {aw(ax)}. 


a—oco 
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Figure B.3-1 Rectangular and Gaussian-shaped pulses of unit area. 


Another possibility is to define 6(a) as 


6(x) 2 lim {aexp(—1a?2”)}. 


a—oo 


The rectangular and Gaussian shaped pulses are shown in Figure B.3-1. The function 
aw(azx) has discontinuous derivatives, whereas aexp(—a?x7) has continuous derivatives. 
The exact shape of these functions is immaterial. Their important features are (1) unit area 
and (2) rapid decrease to zero for x 0. 

Still another defintion is to call any object a delta function if for any function f(-) 


continuous at « it satisfies the integral equation! 


/ ” few Das Fe. (B31) 


This definition can, of course, be related to the previous one, since either of the pulses 
when substituted for 6(”) in Equation (B.3-1) will essentially furnish the same result when 
a is large. This follows because the integrand is significantly nonzero only for 7 ~ y. The 
integral can, therefore, be approximately evaluated by replacing f(y) by f(x) and moving 
it outside the integral. Then, since both pulses have unit-area, the result follows. Note that 
d(x) = 6(—2). 

Consider now the unit step u(a — 2;), which is discontinuous at « = 2; with u(0) 41 
(Figure B.3-2a). The discontinuity can be viewed as the limit of the function shown in 
Figure B.3-2b. The derivative is shown in Figure B.3-2c. 

The derivative of the function shown in Figure B.3-2b is given by 


dF| «a dF(a;) _ i L-— Xj 
dx Li 7 dx; 7 omer Ag,” Ax (B.3-2) 
= 0(a — 2). 


+A word of caution is in order here. Since 5(a) is zero everywhere except at a single point, its integral 
(in the Riemann sense) is not defined. Hence, Equation B.3-1 is essentially symbolic, that is, it implies a 
limiting operation as was done with the rectangular and Gaussian pulses. 
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(a) 


Figure B.3-2 (a) unit step u(x— x;); (b) approximation to unit step; (c) derivative of function in (b). 


Thus, formally, the derivative at a step discontinuity is a delta function with weight! 
proportional to the height of the jump. It is not uncommon to call 6(a — a;) the delta 
function at “zx;.” 

Returning now to Equation 2.5-7 in Chapter 2, which can be written as 


F(x) = De P,(x;)u(a — 24) 


and using the result of Equation B.3-2 enables us to write for a discrete RV: 


fei nee = Px (esl =), (B.3-3) 


where we recall that Px (x) SF (x;) — F(a; ) and the unit step assures that the summation 
is over all i such that x; < x. 


It is also called the area of the delta function. 
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INJ4N)))ae Functional 
Transformations 
and Jacobians 


C.1 INTRODUCTION 


Functional transformations play an important role in probability theory as well as many 
other fields. In this appendix, we shall review the theory of Jacobians, beginning with a 
two-function-to-two-function transformation and extending the result to the n-function-to- 
n-function case. First, we should recall two basic results from advanced calculus: 


Theorem C.1-1 Consider a bounded linear transformation L from E” to E”. If D is 
a bounded set in £” with n-dimensional volume V(D), then the volume of L(D) is merely 
k x V(D), where k is a constant independent of D. 


Theorem C.1-2_ If T is a transformation of class C! from E” to E” in an open set D 
then, at every point p € D, dT is a linear transformation from E” to E”. 


The first theorem states that the effect of Z is merely to multiply the volume by a 
constant that doesn’t depend on the shape of D. The second theorems states that, at the 
differential level, even nonlinear transformations become linear, provided that the transfor- 
mations consist of differential functions. Both theorems will find application in this develop- 
ment. 
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wt+dw y' 


wt 


Infinitessimal rectangle. Mapped infinitessimal rectangle 
into an infinitessimal parallelogram. 


Figure C.2-1 


C.2 JACOBIANS FOR n = 2 


Consider the pair of one-to-one! differentiable functions v = g(x,y), w = h(a,y) with 
the unique inverse x = ¢(v,w), y = y(v,w). As the vector z = (v,w) traces out the 
infinitesimal rectangle # in the v’—w’ plane, the vector u = (x, y) traces out the infinitesimal 
parallelogram & in the 2’-y’ plane. By Theorem C.1-1, this differential transformation is 
linear, and by Theorem C.1-2, the ratio of the areas, A(S)/A(), is a constant. We shall 
denote this constant by |./| and compute its value. 

We can compute the constant J with the aid of Figure C.2-1. Recalling that « = d(v, w), 
y = v(v, w), we compute the image points P,, Po, Ps of the vertices at P 1, Pz, P3 as: 


- zs do dp z fy dp 
1=(2,y), Po («+ ap ey t sede) , Ps («+ Ay wey t 5H 
These results are directly obtained by a Taylor series expansion about (x, y). Thus, for 
example, the coordinates (a2, y2) of Pz are obtained from 
fy 


2 = e(v i dv, w) x o(v, w) + A) 


0 
aye? and y2 = yp(v + du, w) & y(v,w) + & dv. 


Ov 

There are no nonzero derivatives with respect to w because w is held constant in going 
from P, to Pg. A result from vector analysis, is that the area of a parallelogram spanned 
by the vectors v; and vg is given by the magnitude of the cross-product, that is, 

do dp do dp 
A(S3) = |vq X vg] = || —-idv + —jdv | x | ——idw + ——jdu }|, 
(S) = |v1 2 (¢ av? iu aw 

where we used the fact that v; =P» —P, and vz = P; —P,. The unit vectors i, j satisfy 
ixj=k,jxiz= —-k,ixi=jxj=0, where k | i,j and points out of the plane of the 
paper. Thus, 


+This means that every point (x,y) maps into a unique (u,v) and vice versa. 
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0¢60p 0609 
9S) = |——— - ——|dvdw. 
a Ea dw av |e 


Since A(R) = du dw, we find that the ratio of the areas is 


(Qe 


A ~ 
~ |dvdw Ow dv =I 


In higher dimensions it is easier to write J as a determinant. Indeed, even in this 
two-dimensional case, we can write: 


d¢ 96 
7_| de Bol _ 0 aay 
Op dp Ov Ow = Ow Ov 
dv Ow 
The quantity J is called the Jacobian of the transformation x = ¢(v,w), y = y(v, wv). 
Among other things, the Jacobian is necessary to preserve probability measure (some- 
times called the probability mass or probability volume). For example, consider a pdf 


fxy(x,y) and the transformation « = $(v,w), y = y(v,w). Consider the event B 4 {¢: 
(X,Y) € 9c E*}. Then 


P(B)= | [ fav (au)dedy # f / fxv (6(v, w), pv, w))do dw 


because the volume dx dy 4 dudw. What is needed is the Jacobian to create the equality 
among the integrals as 


[ [fr @ndedn= ff tar(olv.w), 60, w)}Adoae. 


Sometimes it may be easier to deal with the original functions v = g(a, y),w = h(a, y) 
than the inverse functions « = $(v,w), y = y(v, w). To get the desired result, we recompute 
the ratio of areas by considering the image, 3’, in the v-w system, of an infinitesimal 
rectangle, 9’, in the 2—-y system (Figure C.2-2). Following the same procedure as before, we 
obtain A(S’) /A(R’) 21 /|J|, where the primes help indicate the regions in the two systems 
and J is given by 


og 99 

Ox Oy 
J = 

dh dh 

Ox Oy 


But, by Theorem C.1-1, A(9’)/A(#’) = A(S)/A(R) and, hence, |J| = 1/|J| or |JJ| = 1. 
We leave the details of the computation as an exercise for the reader. 
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y+dy 


v’ 


Infinitessimal parallelogram. Infinitessimal parallelogram is mapped 
into an infinitessimal rectangle. 


Figure C.2-2 


C.3 JACOBIAN FOR GENERAL n 


The general case is easier to deal with if we allow ourselves to use matrix and vector notation 
and some results from linear algebra. First, it is not convenient to use the unit vectors i, j, k 
in higher dimensions. Instead, we use unit vectors that are represented by column vectors. 
Thus, in E? we use e; = [1,0]? and e2 = [0,1]7. Then 


_ a6 dp, _[d¢, de, 17 
Vi= ay ey + aye eg = soa aut 
and 
_ 06 dp, _[d¢, O,)" 
Vo= 7p tw el + Aap oe 2 = seaw Aw te 


Next, we form the 2 x 2 matrix V2 = [vi vg], where the subscript 2 on V2 refers to 
two-dimensional Euclidean space. 

Then, for the special case of n = 2, A(S%) is given by | det V2|. As we go to higher 
dimensions we drop the term “area of the parallelepiped” in favor of “volume of the paral- 
lelepiped,” although purists would argue that for spaces of dimensions higher than three we 
should use “hypervolume.” Also in higher dimensions, it is easier to use different subscripts 
rather than different symbols for functions and arguments. In n-dimensional space, the 
volume of a parallelepiped is always given by the height times the base area, where the base 
area is the volume of the parallelepiped in n — 1 dimensional space and the height is the 
length of the component of v,, which is orthogonal to the vectors that span E”~!. Thus, 
in E? the base area is the length of the chosen base vector and the height is the length 
of the orthogonal component of the second vector. In E%, the base area is the area of 
the parallelogram spanned by any two of the three vectors and the height is the length 
of the component of the third vector orthogonal to the plane containing the first two 
vectors. 
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We wish to compute the volume of an infinitesimal parallelepiped in n-dimensional 
space. Motivated by the fact that the volume, V2, in two-dimensional space is given by 
V2 = |det V2|, we are tempted to write that V,, = | det V,,|. Is this true? The answer is yes 
and the proof is furnished by induction. Thus, we assume that V,, = | det V,,| is true and 
we must prove that V,11 = | det V,,.,|. Now in terms of the vectors v1,V2,°°+ ,Vn,Vn+1; 
the matrix V,,41 can be written as 


Vivre Vn} Unt+1,1 


ace <a ‘ Un+1,2 
Vint = . . . . 


0 0 O tata 


Vn Un+1,1 


0 yore 0 ere rere, 


To compute |det V,41| we expand by the bottom row to obtain | det Vniil = 
|Un+1,v+1|| det V,,|, since all other terms in the expansion are zero. Now consider the vector 
Vn+1 in more detail. In terms of the unit vectors e1,e2,...,@n+41, it can be written as 


n 
Vn+1 = Un+1,n4+1€n4+1 i S Un+1,i ej, 
i=1 


where e; has a 1 in the ith position (row) and 0’s in the remaining n positions. But e,+1 
is the unit vector orthogonal to the e;, e9,...,e,, and hence is orthogonal to the space 
spanned by them, and |Up+41,n+1] is its height. Also recall that | det V,,| is the volume of the 
parallelepiped in n-dimensions and therefore represents the base area in n + 1 dimensions. 
Hence | det Vn4il = |Yn+i.n+i||det V»| is indeed height times base area and the proof is 
complete. 

Readers familiar with Hadamard’s inequality and the Gram—Schmidt orthogonalization 
procedure can furnish a faster, more direct, proof that avoids induction, but is less intuitive. 


Example C.3-1 
In Chapter 5 we considered the transformation 


w= gi (21,22, ares ,n) 


Yy2>= g2(X1, La, eee (En) 


Yn = Gn(@1, £2, tee .Bn)5 
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with unique inverse 


Ly = 1 (41, Yo, -- tr) 
lg = bo(Y1;Y2;-- Un) 


In = Pn (Y1s Yo ++ a Ure) 


Then, a rectangular parallelepiped in the (y1,y2,..., Yn) system with volume [T}_, |dyi| 


maps into a parallelepiped in the (a1,2%2,...,%) system with volume |detV,,| = 
| det[vi1, V2,---;Vn]|- Here, by computing the differentials of the transformation, we obtain 
for the v;,2=1,...,n: 


fas Oo, . Obn, a ge 
via (Bedi. Peay, 5 t Shy eng 


INJJ4.))@° Measure and 
Probability 


D.1 INTRODUCTION AND BASIC IDEAS 


Some mathematicians describe probability theory as a special case of measure theory. 
Indeed, random variables are said to be measurable functions; the distribution function 
is said to be a measure; events are measurable sets; the sample description space together 
with the field of events is a measurable space; and a probability space is a measure space. 
In this appendix, we furnish some results for readers not familiar with the basic ideas of 
measure theory. We assume that the reader has read Chapter 1 and is familiar with set 
operations, fields, and sigma fields. The bulk of the material in this appendix is adapted 
from the classic work by Billingsley.? 

Let 2 be a space (a universal set) and let A, B,C,... be elements (subsets) of 2. Also, 
as in the text, let @ denote the empty set. Let & be a field of sets on 2. Then the pair 
(Q,%) is a measurable space if S is a o—field on Q. Let ps be a set function* on S. Then py 
is a measure if it satisfies these conditions: 


(i) Let AES, then pA] € [0, 00); 
Gi) ud) = 0; 


Patrick Billingsley, Probability and Measure. New York: John Wiley & Sons, 1978. 
+A set function is a real valued function defined on the field S of subsets of the space (. 


D-1 


D-2 Appendix D- Measure and Probability 


(iii) if Ay, Ag,... is a disjoint sequence of sets in S and if UP2, A, € S, then 
co co 
HM U A;| = by HLAx]. 
k=1 k=1 


This property is called countable additivity. A measure yu is called finite if p[Q] < co; 
it is infinite if u[Q] = oo. It qualifies as a probability measure if p[Q] = 1, as denoted in 
Chapter 1. If S is a o-field in Q, the triplet (Q,S, ~) is a measure space. 

Countable additivity implies finite additivity, that is, 


U A. 
k=1 


if the sets are disjoint. A measure 4s is monotone, that is u[A] < [|B] whenever A C B. 
The proof of this statement is straightforward. Write, as is customary in the literature on 
measure theory, BA‘ 2 B-Aand B= (B — A) UAB = (B — A)UA. Then, since A 
and B-A are disjoint, it follows that y[B] = u[B— A] + pA] > p[A]. Also, since AU B= 
(A — B)U(B — A) UAB, it follows that p{[AU B] = p[A — B] + p[B — A] + [AB]. This 
result can be extended to many sets in a o-field, (sets in a o-field are called o-sets), that is, 


UA. 
k=1 


Of course, this equation makes sense only if the sets have finite measure. It is also straight- 
forward to show that ju[-] has the property of subadditivity: 


UA 
k=1 


n 


= > LAR] 


k=1 


Lt 


Lb 


= So ulAg] — So wfAiAg] +... + (-1)" [A A... An] 
k=1 


i<j 


be SS So ulAgl- 
k=1 


Example D.1-1 
Lebesgue measure. Consider the o-field, S, of intervals on Q = (0,1). The elements of S are 
called linear Borel sets and the o-field of intervals is called the Borel field .2 We shall use 
this notation for any o-field on the real line. A measure ju[-] on S is 4 = X(a, b) & b—a, where 
b >a. This measure is called the Lebesgue measure on (a, b]. It can be directly generalized 
to the real line Rt. An extension of the Lebesgue measure to k-dimensional Euclidean 
space is: 


k 
w= alsa; <a; <b,6=1,...,k] 2 ]] (i — a) 


Thus, the Lebesgue measures are length (k = 1), area (k = 2), volume (k = 3), and hyper- 
volume (k > 3). We denote the associated o-field generated by these generalized rectangles 
by the symbol .#*. 
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There are many important theorems regarding measures. We cite several below. 


Theorem D.1-1 (Translation invariance.) Let A €.#* and define A +x & {a+a: 
a € A}. Then \;,(A + x) = ;(A) for all translation vectors 7. 


Theorem D.1-2 (Lebesgue measure of transformation.) Let T: R* — R* denote a 
linear and nonsingular transformation from the Euclidean space R* to R*. Then A € .2* 
implies that TA € .2* and A, (TA) = |detT| - A, (A). For example, if T is a rotation, 
or reflection, that is, an orthogonal or unitary transformation, then |detT| = 1 and 


\(TA)=\,(A). © 


Theorem D.1-3 (Lebesgue Measure of Subspaces of R*). Every (k — 1) dimensional 
hyperplane has k-dimensional Lebesgue measure zero. [i 


Theorem D.1-4 (Continuity of measure.) (i) Let 4 be a measure on a field S. Then 
if A, and A lie in S and A, 7 A, then p[A,] f u[A]. This is called continuity of measure 
from below. Ay, } A means that Ayn_1 C An C Anyi C++: and 


A= U An 
n=1 


Likewise, p1[A,] f [A] means that u[A,] < w[An+ai] < [A] and lim p[An] = pA]. 


(ii) Let ys be a measure on a field S. Then if A, and A lie in S and A, | A, then 
LA] | u[A]. This is called continuity of measure from above. A, | A means that A,_1 D 
An D> Anyi D-+: and 


A=] An 
n=1 
Likewise, p4[A,] | [A] means that u[A,] > u[An+yi] > [A] and lim p[A,] = u[A]. 


Measurable Mappings and Functions 


Let (Q,3) and (Q', 3") be two measurable spaces with two sets A € S and A’ € S’. Fora 
mapping T : 2 — , consider the inverse image T~1A’ = {w €0:Twe€ A’} for A’ ca’. 
The mapping is measurable if T~'.A’ € & for every A’ € S'. For example, consider the unit 
interval 2 = (0,1) with S$ = .% and the mapping Tz = 2”. Here, N' = Q and F = FZ. 
Clearly, the inverse image of every Borel interval in 9’ is a Borel interval in . Hence, T is 
a measurable mapping. 

A real function X on Q, with image space R!, is said to be measurable if its inverse 
image X~'B= {w: X(w) € B} © S for every BES. 


D.2 APPLICATION OF MEASURE THEORY TO PROBABILITY 


A set function P on a o-field S is a probability measure if: 


(i) O< P[A] <1 for every AES; 
(ii) P[¢] = 0, P(Q) = 1, 
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(iii) if Ay, Ao,...,A,,... is a disjoint sequence of S-sets such that 
U Apes 
k=1 
then 
P|) Ax} = >_> PIAgl 
k=1 k=1 


(This is the countable additivity property of the probability measure.) 


Distribution Measure 


In keeping with the notation in the main text, we replace w with ¢ to denote the elements of 
Q. Recall that this was done to save w for the Fourier transform variable needed throughout 
the text. Let B € .% the Borel o-field of intervals on the real line. Consider a (probability) 
measure ps on (R!,.%) defined by y[B] & Pi¢: X(¢) € B] = Px[B]. This measure is called 
the distribution or law of a random variable. The distribution function of X is defined by 


Fx (a) 2 p(—-00, 2] = PIX < al, 


where P[X < a] is short for P[¢ : X(¢) < a]. By the continuity from above part of the 
continuity of measure theorem, Fx (a) is continuous from the right. 

Since the field of events is a o-field, and the distribution function is generated by a 
measure, all of the properties of measures apply in probability. It is for this reason that 
probability and measure theories are so closely related. However, to look at probability 
theory just from the point of view of measure theory is to ignore its rich calculus which 
enables the solution of engineering, scientific, and statistical problems. 


VJ 24\)))@9 Sampled Analog 
Waveforms and 
Discrete-time Signals 


Discrete-time signals are often realized by sampling continuous-time analog wave forms. 
Here, we briefly review the relationship between the two types of signals. The reconstruction 
of a continuous-time signal from its equally-spaced samples is governed by the famous 
Whittaker-Nyquist-Shannon sampling theorem, which states the following. 


Theorem E.1-1 A continuous signal x(t) with real frequencies no higher than Umax 
can be reconstructed exactly from its samples x(nT) if the sampling interval T satisfies 
T<>;- © 

The proof of this important theorem is given in many places, for example, Principles of 
Communication Engineering by John M. Wozencraft and Irwin M. Jacobs, John Wiley and 
Sons, NY, 1965. Let a(t), y(t), and h(t) denote the input signal, output signal, and impulse 
response of a linear, shift-invariant (LSI) system respectively. Let B, in Hertz, denote a 
bandwidth that is greater than any signal or system bandwidth encountered in the system 


and let A 21 /(2B). For ease of notation define 


. A sin7x 
sinc (x) = ; 


TL 


The relationship between input and output for an LSI system is 


y(t) = [. h(s)x(t — s)ds 


—cCo 
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and from the sampling theorem: 


y(t) = 5~ y(A) sine (2B[t - 1A), 
l 

x(t) = 5° (1A) sinc (2B[t — 1A), 
l 


h(t) = S~ A(iA) sinc (2B[t — 1A). 
l 


If we insert the top three lines into the input-output integral being careful about using 
different subscripts, and evaluate at y(t) at t=JA, we obtain 


y(IA) = S° S$ h(nA)a(mA)I(L, m,n), 


n m 


where 
I(l,m,n) 2 / sinc (2B[s — nA]) sinc (2B[s — (l— m)A])ds = 0, 


for all real integers 1, m,n except when !—m = n, whereupon it assumes the value A. Hence, 
we obtain the important result that 


y(IA) =~ n(nd)e([ = nJA)A, 


Often the factor A is submerged into h(nA). In a computer the sampled values of 


the functions become mere sequences of numbers as y(l/A) & yll], «(1A) & zl], and 


h(nA) a h[n]. Then, we obtain 
yln| = So h[nla[l — | 


that we recognize as a discrete convolution. The important fact to remember is that the 
processing of analog signals can be done by operating on their samples and then recon- 
structing an analog waveform by filtering. 

Another point to consider is that the sequence of numbers {x[n]} does not contain infor- 
mation about the sampling period, For example, consider the sinusoid 2(t) = Acos(w,t+). 
If we sample at t = nA, n = ...,—2,—1,0,1,2,..., we obtain the samples z(nA) = 
Acos(nAw,. + 0) = Acos(nw + 6) 8 z|[n], where w 2 Aw,. The radian “frequency” w is 
dimensionless, which is consistent with the dimensionless “time” n. It is well to remember 
that to convert to analog frequencies w, (radians/sec) or v, (Hertz) we must use w, = wA 
or vp = vA. For example, the Fourier transform of a sequence of numbers {2[n]} will yield 
a spectrum of sinusoids at normalized frequencies w that lie in the interval [—7,7z]. If we 
convert to analog radian frequencies, then the spectrum will lie in the interval [—27 B, 27 B]. 


I\223151))@m Independence of 
Sample Mean and 
Variance for Normal 
Random Variablesi 


Of all the distributions we encounter in probability and statistics, without doubt, the Normal 
(Gaussian) distribution is of greatest importance. There are a number of reasons for this, 
but first and foremost is the Central Limit Theorem (CTL), which states that under a set 
of reasonable and realistic conditions the sum of a large number of independent random 
variables tends to have a Normal CDF. This property enables us to solve many problems in 
statistics by invoking the CTL when the sample size is large. Readers of Chapters 6 and 7 
will have noticed that we use the CTL to generate results that otherwise would have been 
difficult to obtain. 

There are other reasons why the Normal distribution plays such an important role in 
probability and statistics. One of them is that the univariate Normal pdf has two parameters 
that are algebraically independent, that is, within their range they can have any arbitrary 
values without conflicting with each other. The mean jz can have any value in (—oo, co) and 
the variance o? can have any value in (0,00). This suggests that we can always design a 
generator of Normal data that will have a specified mean and variance. The same is true 
for the multivariate, that is, multidimensional, Normal distribution. That is, given a mean 
vector and covariance matrix, respectively, u,K, we can always design a Normal generator 
whose data will have these parameters. The Normal pdf also enjoys completeness, a property 
of importance in finding a class of optimum estimators called minimum variance, unbiased 
estimators. 

Given the importance of the Normal distribution, the estimation of its parameters 
juand o? is a central problem in statistics. Assume that we make n iid. observations 


‘The proof substantially follows that given in [7-1] 
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on X: N(1,07). We estimate yz with(1/n) 57, X; (sample mean) and and o? with (1/n) 
(Sta (i= G/m) D1 %)) of Gfl=1) (Sha (Ki - G/m) D7 %)”) (Game 


variance). We note that both the sample mean and sample variance use the same data. 
Remarkably, the sample mean and sample variance are statistically independent’ . In 
proving this result we shall use a Theorem from probability theory: If the joint moment- 
generating function of two random variables V and W, say Myw/(ti,t2), factors as My (t1) 
Mw(t2), then V and W are independent. This result was derived in Example 4.7-1 for 
characteristic functions i.e., moment generating functions evaluated at t = jw. 

The two random variables of interest are the estimators fix and ae; For simplicity and 
to keep the algebra to a minimum, we define 


We note in passing that V: 7 and W: y2_,. Now recall that Myw(t1, ta) is given by 


Myvw (tf, to) = E [exp( (iV + toW)| 


= f° ff c0r9| noe) tnt 


where 


2 
n n 
Q2 re -28 (Sw) 2 D7, (w- (FD) w)) 
= >. S~" rigyiyj =y'R ty where R is a covariance matrix 
jar orient 9779 
with diagonal elements r;; and off-diagonal elements r;;, i 4 7, where 


2(t1 — ta) 
nr 


rig = 1 — to ,i=1,...,n, diagonal terms of R (F-la) 


2(t; —t 
r= a i,j =1,...,.n;i 4 j off-diagonal terms of R. (F-1b) 


Recalling that the multidimensional Normal pdf is written as 


1 1 = 
exp(—=y’ R“'y) 


fy(y) = Qn"?Ri/2 2 


+ Independent or independence is meant in a statistical sense. Else we use algebraic or functional inde- 
pendence. 
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and that [°. fy(y)dy = 1, we conclude that 


Mvw (th, ta) =f (exp(tiV + toW)) 


fi fonf. enn ex (42) x dy, dyz-+- dyn. 


= IRI? 


From matrix theory, it is known that for any n x n matrix R with diagonal elements a 
and off-diagonal elements b, the determinant |R| is computed as (a — b)"~! (a + (n — 1)b). 


Substituting a 2 riz, 0 - r;; (from Equation F-1) we obtain 


Myw(ta,ta) = (1 — 26) 7°71 — 2) Ot) = 1/2, to < 1/2, 
= My (t1) x Mw (t2). 


Hence by from the Theorem quoted at the beginning of the discussion we conclude that 
V and W are independent. Hence Fyw(v,w) = Fy(v)Fw(w) and therefore that ji and 
&y are independent. It can be shown that if ie and o% are independent then so are jx 
and o%. This important result enables us to select separate confidence intervals for fix 
and o% without fear of contradiction. The independence of jix and o% is true only in the 
Normal case. 


NJJ4\)))aen Tables of Cumulative 
Distribution 
Functions: the 
Normal, Student t, 
Chi-square, and F 


In the following pages we present tables of the CDF of the (1) Normal; (2) Student-t; (3) 
Chi-square; and the F, the latter sometimes called the Snedecor F distribution. 

The gamma function [(a) = i, xz°—te~*dx,a > 0 appears in several of the CDFs 
below. When a is an integer, say, a =m > 1, then I'(m) = [m — 1]! = (m—1) x (m— 2) x 
--+ x 2x 1. Note 0!=1. Next to each CDF are a few of its applications. 

(1) Standard Normal (extensively used in probability and statistics) 


ied ae 


The general univariate Normal CDF is a function of two parameters the mean py and the 


variance 0”. 


(2) Student-t (interval estimation, tests on the means of Normal populations 
|. = My versus pF fo) 


a F(im+1]/2) 


me i 1 
Fr(a;m) = K I. Jat Gm Gn aye fee 


The Student-t distribution is a function of the parameter m called the degrees of freedom 
(DOF). It is a special case of the F-distribution. 
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(3) Chi-square (confidence intervals for variance of Normal populations, 


P 2-22 2 2 ’ 
testing 0° = 06 versus o* #4 05, Pearson’s goodness-of-fit ) 


Fale;m) = K' | y™/?-1 exp (-3) ay 
0 


A 1 
— 2™/2T'(m/2) 


The Chi-square CDF is a function of the parameter m called the degrees of freedom (DOF). 


(4) Snedecor F (generalized likelihood ratio, testing of = 03 versus o7 4 03) 


me) —(m+n)/2 a 


Fe(aym,n) = K° [ gut +7 
0 n 
r m+n 
aA 2 my\m/2 
= m n ( ) ; 
a) 
The Snedecor F CDF is a function of two parameters m and n. These are called the degrees 


of freedom (DOF) of the F-distribution. When referring to the DOF, the parameter m is 
quoted first. 
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Table 1 Standard Normal CDF 


F'n (a) is the table entry. First digit of x gives the row, and second digit of x gives the position in 


the row. 
£ .00 O01 .02 03 04 .05 .06 .O7 .08 .09 
.O .5000 5040 .5080 .5120 .5160 5199 .5239 5279 .5319 .5359 
Al 5398 5438 5478 5517 5557 .5596 .5636 5675 5714 .5753 
2 5793 5832 5871 .5910 5948 5987 .6026 .6064 .6103 .6141 
3 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517 
4 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879 
5 .6915 .6950 .6985 -7019 .7054 -7088 .7123 .7157 .7190 .7224 
6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 -7T486 .7517 .7549 
at -7580 .7611 .7642 .7673 .7704 .7734 .7764 .T794 .7823 .7852 
8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133 
no) .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389 
1.0 8413 .8438 .8461 .8485 .8508 8531 .8554 8577 .8599 .8621 
1.1 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830 
1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015 
1.3 .9032 .9049 .9066 -9082 .9099 .9115 9131 .9147 .9162 9177 
1.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319 
1.5 9332 .9345 .9357 -9370 .9382 .9394 -9406 .9418 .9429 .9441 
1.6 9452 .9463 9474 .9484 .9495 .9505 .9515 9525 .9535 .9545 
Lf 9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633 
1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706 
1.9 9713 .9719 .9726 .9732 9738 .9744 .9750 .9756 9761 .9767 
2.0 9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817 
2.1 .9821 .9826 .9830 .9834 9838 .9842 .9846 .9850 .9854 .9857 
2.2 9861 .9864 .9868 9871 .9875 .9878 9881 9884 .9887 .9890 
2.3 .9893 .9896 .9898 .9901 .9904 .9906 .9909 9911 .9913 .9916 
2.4 .9918 .9920 .9922 .9925 .9927 .9929 9931 .9932 .9934 .9936 
2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952 
2.6 9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964 
2.7 .9965 .9966 .9967 .9968 .9969 .9970 9971 9972 .9973 .9974 
2.8 9974 .9975 .9976 9977 9977 .9978 .9979 9979 .9980 9981 
2.9 9981 .9982 .9982 .9983 .9984 .9984 .9985 9985 .9986 .9986 
3.0 9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990 
3.1 .9990 .9991 .9991 .9991 .9992 .9992 .9992 .9992 .9993 .9993 
3.2 .9993 .9993 .9994 .9994 .9994 .9994 .9994 9995 .9995 .9995 
3.3 .9995 .9995 .9995 .9996 .9996 .9996 .9996 .9996 .9996 .9997 
3.4 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9998 
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Table 2. Student-t CDF 


For each F'r(x;n) given across the top of the table, row nm then determines the table entry, the 
corresponding value of «x. 


F 
n 0.60 0.75 0.90 0.95 0.975 0.99 0.995 0.9995 
1 0.325 1.000 3.078 6.314 12.706 31.821 63.657 636.619 
2 0.289 0.816 1.886 2.920 4.303 6.965 9.925 31.598 
3 0.277 0.765 1.638 2.353 3.182 4.541 5.841 12.924 
4 0.271 0.741 1.533 2.132 2.776 3.747 4.604 8.610 
5 0.267 0.727 1.476 2.015 2.571 3.365 4.032 6.869 
6 0.265 0.718 1.440 1.943 2.447 3.143 3.707 5.959 
il 0.263 0.711 1.415 1.895 2.365 2.998 3.499 5.408 
8 0.262 0.706 1.397 1.860 2.306 2.896 3.355 5.041 
9 0.261 0.703 1.383 1.833 2.262 2.821 3.250 4.781 
10 0.260 0.700 1.372 1.812 2.228 2.764 3.169 4.587 
11 0.260 0.697 1.363 1.796 2.201 2.718 3.106 4.437 
12 0.259 0.695 1.356 1.782 2.179 2.681 3.055 4.318 
13 0.259 0.694 1.350 1.771 2.160 2.650 3.012 4.221 
14 0.258 0.692 1.345 1.761 2.145 2.624 2.977 4.140 
15 0.258 0.691 1.341 1.753 2.131 2.602 2.947 4.073 
16 0.258 0.690 1.337 1.746 2.120 2.583 2.921 4.015 
17 0.257 0.689 1.333 1.740 2.110 2.567 2.898 3.965 
18 0.257 0.688 1.330 1.734 2.101 2.552 2.878 3.922 
19 0.257 0.688 1.328 1.729 2.093 2.539 2.861 3.883 
20 0.257 0.687 1.325 1.725 2.086 2.528 2.845 3.850 
21 0.257 0.686 1.323 1.721 2.080 2.518 2.831 3.819 
22 0.256 0.686 1.321 LeeLee 2.074 2.508 2.819 3.792 
23 0.256 0.685 1.319 1.714 2.069 2.500 2.807 3.767 
24 0.256 0.685 1.318 1.711 2.064 2.492 2.797 3.745 
25 0.256 0.684 1.316 1.708 2.060 2.485 2.787 3.725 
26 0.256 0.684 1.315 1.706 2.056 2.479 2.779 3.707 
27 0.256 0.684 1.314 1.703 2.052 2.473 2.771 3.690 
28 0.256 0.683 1.313 1.701 2.048 2.467 2.763 3.674 
29 0.256 0.683 1.311 1.699 2.045 2.462 2.756 3.659 
30 0.256 0.683 1.310 1.697 2.042 2.457 2.750 3.646 
40 0.255 0.681 1.303 1.684 2.021 2.423 2.704 3.551 
60 0.254 0.679 1.296 1.671 2.000 2.390 2.660 3.460 
120 0.254 0.677 1.289 1.658 1.980 2.358 2.617 3.373 
oo 0.253 0.674 1.282 1.645 1.960 2.326 2.576 3.291 


Adapted from W.H. Beyer, Ed., in CRC Handbook of Tables for Probability and Statistics, 2d ed., The 
Chemical Rubber Co., Cleveland, 1968; p. 283. With permission of CRC Press, Inc. 


Table 3. Chi-Square CDF 


For each F\.2(x;m) given across the top of the table, row n then determines the table entry, the corresponding value of z. 


n/F 005 .010 025 .050 .100 .250 500 .750 900 .950 975 990 995 
1 04393 .0°157 .0°982 07393 0158 102 455 1.32 2.71 3.84 5.02 6.63 7.88 
2 .0100 0201 .0506 103 211 575 1.39 2.77 4.61 5.99 7.38 9.21 10.6 
3 0717 115 216 352 584 1.21 2.37 4.11 6.25 7.81 9.35 11.3 12.8 
4 .207 297 A84 711 1.06 1.92 3.36 5.39 7.78 9.49 111 13.3 14.9 
5 412 554 .831 1.15 1.61 2.67 4.35 6.63 9.24 111 12.8 15.1 16.7 
6 676 872 1.24 1.64 2.20 3.45 5.35 7.84 10.6 12.6 14.4 16.8 18.5 
7 989 1.24 1.69 2.17 2.83 4.25 6.35 9.04 12.0 14.1 16.0 18.5 20.3 
8 1.34 1.65 2.18 2.73 3.49 5.07 7.34 10.2 13.4 15.5 17.5 20.1 22.0 
9 1.73 2.09 2.70 3.33 4.17 5.90 8.34 114 14.7 16.9 19.0 21.7 23.6 

10 2.16 2.56 3.25 3.94 4.87 6.74 9.34 12.5 16.0 18.3 20.5 23.2 25.2 
11 2.60 3.05 3.82 4.57 5.58 7.58 10.3 13.7 17.3 19.7 21.9 24.7 26.8 
12 3.07 3.57 4.40 5.23 6.30 8.44 113 14.8 18.5 21.0 23.3 26.2 28.3 
13 3.57 4.11 5.01 5.89 7.04 9.30 12.3 16.0 19.8 22.4 24.7 27.7 29.8 
14 4.07 4.66 5.63 6.57 7.79 10.2 13.3 17.1 211 23.7 26.1 29.1 31.3 
15 4.60 5.23 6.26 7.26 8.55 11.0 14.3 18.2 22.3 25.0 27.5 30.6 32.8 
16 5.14 5.81 6.91 7.96 9.31 11.9 15.3 19.4 23.5 26.3 28.8 32.0 34.3 
17 «45.70 6.41 7.56 8.67 10.1 12.8 16.3 20.5 24.8 27.6 30.2 33.4 35.7 
18 6.26 7.01 8.23 9.39 10.9 13.7 17.3 21.6 26.0 28.9 31.5 34.8 37.2 
19 6.84 7.63 8.91 10.1 11.7 14.6 18.3 22.7 27.2 30.1 32.9 36.2 38.6 
20 7.43 8.26 9.59 10.9 12.4 15.5 19.3 23.8 28.4 31.4 34.2 37.6 40.0 
218.03 8.90 10.3 11.6 13.2 16.3 20.3 24.9 29.6 32.7 35.5 38.9 41.4 
22 8.64 9.54 11.0 12.3 14.0 17.2 21.3 26.0 30.8 33.9 36.8 40.3 42.8 
23 (9.26 10.2 17 13.1 14.8 18.1 22.3 27.1 32.0 35.2 38.1 41.6 44.2 
24 9.89 10.9 12.4 13.8 15.7 19.0 23.3 28.2 33.2 36.4 39.4 43.0 45.6 
25 10.5 11.5 13.1 14.6 16.5 19.9 24.3 29.3 34.4 37.7 40.6 44.3 46.9 
26 11.2 12.2 13.8 15.4 17.3 20.8 25.3 30.4 35.6 38.9 41.9 45.6 48.3 
27 11.8 12.9 14.6 16.2 18.1 21.7 26.3 31.5 36.7 40.1 43.2 47.0 49.6 
28 12.5 13.6 15.3 16.9 18.9 22.7 27.3 32.6 37.9 41.3 44.5 48.3 51.0 
29 13.1 14.3 16.0 17.7 19.8 23.6 28.3 33.7 39.1 42.6 45.7 49.6 52.3 
30 13.8 15.0 16.8 18.5 20.6 24.5 29.3 34.8 40.3 43.8 47.0 50.9 53.7 


Gy 

«This table is abridged from “Tables of Percentage Points of the Incomplete Beta Function and of the Chi-square Distribution,” 
Biometrika Vol. 32 (1941). It is here published with the kind permission of the author, Catherine M. Thompson, and the editor of 
Biometrika. 
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Table 4 Cumulative F CDF 


For each n in the second column on the left and each m in the uppermost row, the entry in the table furnishes the argument needed to 
yield Fr(a;m,n) in the column at the extreme left. 


G nom 1 2 3 4 5 6 7 8 9 10 12 15 20 30 60 120 oo 

.90 39.9 49.5 53.6 55.8 57.2 58.2 589 59.4 59.9 60.2 60.7 61.2 61.7 62.3 62.8 63.1 63.3 
95 161 200 216 225 230 234 237 239 241 242 244 246 248 250 252 253 254 
975 1 648 800 864 900 922 937 948 957 963 969 977 985 993 1000 1010 1010 1020 
.99 4,050 5,000 5,400 5,620 5,760 5,860 5,930 5,980 6,020 6,060 6,110 6,160 6,210 6,260 6,310 6,340 6,370 
-995 16,200 20,000 21,600 22,500 23,100 23,400 23,700 23,900 24,100 24,200 24,400 24,600 24,800 25,000 25,200 25,400 25,500 
.90 8.53 9.00 9.16 9.24 9.29 9.33 9.35 9.37 9.38 9.39 941 942 944 946 9.47 9.48 9.49 
95 18.5 190 19.2 19.2 19.3 193 194 194 194 194 19.4 194 195 19.5 19.5 19.5 19.5 
975 2 38.5 39.0 39.2 39.2 39.3 39.3 39.4 39.4 39.4 39.4 39.4 39.4 39.4 39.5 39.5 39.5 39.5 
.99 98.5 99.0 99.2 99.2 99.3 99.3 99.4 99.4 994 99.4 99.4 99.4 99.4 99.5 99.5 99.5 99.5 
-995 199 199 199 199 199 199 199 199 199 199 199 199 199 199 199 199 199 
-90 5.54 5.46 5.39 5.34 5.31 5.28 5.27 5.25 5.24 5.23 5.22 5.20 5.18 5.17 5.15 5.14 5.13 
95 10.1 9.55 9.28 9.12 901 894 889 885 881 879 8.74 8.70 866 862 8.57 8.55 8.53 
975 3 17.4 160 154 15.1 149 14.7 146 145 145 144 143 143 142 141 140 13.9 13.9 
.99 34.1 30.8 29.5 28.7 28.2 27.9 27.7 27.5 27.3 27.2 27.1 269 26.7 26.5 26.3 26.2 26.1 
-995 55.6 49.8 47.5 46.2 45.4 44.8 444 441 43.9 43.7 43.4 43.1 42.8 42.5 42.1 42.0 41.8 
.90 4.54 4.32 419 411 405 401 3.98 395 3.93 3.92 3.90 387 3.84 3.82 3.79 3.78 3.76 
95 7.71 694 6.59 6.39 6.26 616 609 604 6.00 596 591 5.86 5.80 5.75 569 5.66 5.63 
975 4 12.2 106 9.98 9.60 9.36 9.20 9.07 898 890 884 8.75 866 856 846 8.36 8.31 8.26 
.99 21.2 180 16.7 160 15.5 15.2 150 148 14.7 145 144 142 140 13.8 13.7 136 13.5 
-995 31.3 26.3 24.3 23.2 225 220 216 21.4 211 21.0 20.7 204 202 199 196 19.5 19.3 
-90 4.06 3.78 362 3.52 3.45 340 3.37 3.34 3.32 3.30 3.27 3.24 3.21 3.17 3.14 3.12 3.11 
95 6.61 5.79 541 5.19 5.05 4.95 4.88 482 4.77 4.74 468 462 456 4.50 4.43 4.40 4.37 
975 5 10.0 843 7.76 7.39 7.15 698 6.85 6.76 668 662 6.52 643 6.33 6.23 612 6.07 6.02 
.99 16.3 133 121 114 11.0 10.7 105 103 10.2 101 989 9.72 9.55 9.38 9.20 9.11 9.02 
-995 22.8 183 165 156 149 145 142 140 13.8 136 134 131 129 12.7 124 12.3 12.1 
-90 3.78 346 3.29 3.18 3.11 3.05 3.01 298 296 2.94 2.90 287 2.84 2.80 2.76 2.74 2.72 
95 5.99 5.14 4.76 4.53 4.39 4.28 421 415 4.10 4.06 400 3.94 3.87 3.81 3.74 3.70 3.67 
975 6 8.81 7.26 660 623 599 5.82 5.70 560 5.52 546 537 5.27 5.17 5.07 4.96 4.90 4.85 
.99 13.7 10.9 9.78 9.15 8.75 847 8.26 810 7.98 7.87 7.72 7.56 7.40 7.23 7.06 6.97 6.88 


-995 18.6 145 129 120 11.5 111 #108 106 104 102 100 981 9.59 9.36 9.12 9.00 8.88 
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10 


12 


15 


10 


2.70 
3.64 
4.76 
6.62 
8.38 


2.54 
3.35 
4.30 
5.81 
7.21 


2.42 
3.14 
3.96 
5.26 
6.42 


2.32 
2.98 
3.72 
4.85 
5.85 


2.19 
2.75 
3.37 
4.30 
5.09 


2.06 
2.54 
3.06 
3.80 
4.42 


12 


2.67 
3.57 
4.67 
6.47 
8.18 


2.50 
3.28 
4.20 
5.67 
7.01 


2.38 
3.07 
3.87 
5.11 
6.23 


2.28 
2.91 
3.62 
4.71 
5.66 


2.15 
2.69 
3.28 
4.16 
4.91 


2.02 
2.48 
2.96 
3.67 
4.25 


15 


2.63 
3.51 
4.57 
6.31 
7.97 
2.46 
3.22 
4.10 
5.52 
6.81 


2.34 
3.01 
3.77 
4.96 
6.03 


2.24 
2.84 
3.52 
4.56 
5.47 


2.10 
2.62 
3.18 
4.01 
4.72 


1.97 
2.40 
2.86 
3.52 
4.07 


20 


2.59 
3.44 
4.47 
6.16 
7.75 


2.42 
3.15 
4.00 
5.36 
6.61 


2.30 
2.94 
3.67 
4.81 
5.83 


2.20 
2.77 
3.42 
4.41 
5.27 


2.06 
2.54 
3.07 
3.86 
4.53 


1.92 
2.33 
2.76 
3.37 
3.88 


30 


2.56 
3.38 
4.36 
5.99 
7.53 


2.38 
3.08 
3.89 
5.20 
6.40 


2.25 
2.86 
3.56 
4.65 
5.62 


2.15 
2.70 
3.31 
4.25 
5.07 


2.01 
2.47 
2.96 
3.70 
4.33 


1.87 
2.25 
2.64 
3.21 
3.69 


60 


2.51 
3.30 
4.25 
5.82 
7.31 


2.34 
3.01 
3.78 
5.03 
6.18 


2.21 
2.79 
3.45 
4.48 
5.41 


2.11 
2.62 
3.20 
4.08 
4.86 


1.96 
2.38 
2.85 
3.54 
4.12 


1.82 
2.16 
2.52 
3.05 
3.48 


120 ioe) 
2.49 2.47 
3.27 3.23 
4.20 4.14 
5.74 5.65 
7.19 7.08 
2.31 2.29 
2.97 2.93 
3.73 3.67 
4.95 4.86 
6.06 5.95 
2.18 2.16 
2.75 2.71 
3.39 3.33 
4.40 4.31 
5.30 5.19 
2.08 2.06 
2.58 2.54 
3.14 3.08 
4.00 3.91 
4.75 4.64 
1.93 1.90 
2.34 2.30 
219 2.72 
3.45 3.36 
4.01 3.90 
1.79 1.76 
2.11 2.07 
2.46 2.40 
2.96 2.87 
3.37 3.26 

(Continued) 
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Table 4 (Continued) 


G n m 1 2 3 4 5 6 7 8 9 10 12 15 20 30 60 120 oo 
-90 2.97 2.59 2.38 2.25 2.16 2.09 2.04 2.00 1.96 1.94 1.89 1.84 1.79 1.74 1.68 1.64 1.61 
95 4.35 3.49 3.10 2.87 2.71 260 2.51 2.45 2.39 2.35 2.28 2.20 2.12 2.04 1.95 1.90 1.84 
975 20 5.87 446 3.86 3.51 3.29 3.13 3.01 2.91 2.84 2.77 268 2.57 2.46 2.35 2.22 2.16 2.09 
.99 8.10 585 494 443 410 3.87 3.70 3.56 3.46 3.37 3.23 3.09 2.94 2.78 2.61 2.52 2.42 
-995 9.94 699 582 5.17 4.76 4.47 4.26 4.09 3.96 3.85 368 3.50 3.32 3.12 2.92 2.81 2.69 
90 2.88 2.49 2.28 2.14 2.05 1.98 1.93 1.88 1.85 1.82 1.77 1.72 167 1.61 1.54 1.50 1.46 
95 417 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.01 1.93 1.84 1.74 1.68 1.62 
975 30 5.57 4.18 3.59 3.25 3.03 2.87 2.75 2.65 2.57 2.51 241 2.31 2.20 2.07 1.94 1.87 1.79 
.99 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 2.84 2.70 2.55 2.39 2.21 2.11 2.01 
-995 9.18 635 5.24 462 423 3.95 3.74 3.58 3.45 3.34 3.18 3.01 2.82 263 2.42 2.30 2.18 
.90 2.79 2.39 2.18 2.04 1.95 1.87 1.82 1.77 1.74 1.71 1.66 160 1.54 1.48 1.40 1.35 1.29 
95 4.00 3.15 2.76 2.53 2.37 2.25 2.17 210 2.04 4199 1.92 1.84 1.75 165 1.53 1.47 1.39 
975 60 5.29 3.93 3.34 3.01 2.79 2.63 2.51 2.41 2.33 2.27 217 2.06 1.94 1.82 1.67 1.58 1.48 
.99 7.08 4.98 413 3.65 3.34 3.12 2.95 2.82 2.72 263 2.50 2.35 2.20 2.03 1.84 1.73 1.60 
-995 8.49 5.80 4.73 4.14 3.76 3.49 3.29 3.13 3.01 2.90 2.74 2.57 2.39 2.19 1.96 1.83 1.69 
.90 2.75 2.35 2.13 1.99 1.90 1.82 1.77 1.72 168 1.65 1.60 1.54 1.48 1.41 1.32 1.26 1.19 
95 3.92 3.07 268 2.45 2.29 2.18 2.09 2.02 1.96 191 1.83 1.75 166 1.55 1.43 1.35 1.25 
975 120 5.15 3.80 3.23 2.89 2.67 2.52 2.39 2.30 2.22 2.16 2.05 194 1.82 1.69 1.53 1.43 1.31 
.99 6.85 4.79 3.95 348 3.17 2.96 2.79 2.66 2.56 247 234 2.19 2.03 1.86 1.66 1.53 1.38 
-995 8.18 554 450 3.92 3.55 3.28 3.09 2.93 2.81 2.71 254 237 2.19 1.98 1.75 1.61 1.43 
.90 2.71 2.30 2.008 1.94 1.85 1.77 1.72 167 1.63 1.60 1.55 149 1.42 1.34 1.24 1.17 1.00 
95 3.84 3.000 260 2.37 2.21 2.10 2.01 194 1.88 1.83 1.75 167 1.57 1.46 1.32 1.22 1.00 
975 oo 5.02 3.69 3.12 2.79 2.57 2.41 2.29 2.19 211 2.05 1.94 1.83 1.71 1.57 1.39 1.27 1.00 
.99 6.63 4.61 3.78 3.382 3.02 2.80 2.64 2.51 241 2.32 2.18 2.04 1.88 1.70 1.47 1.32 1.00 


*This is table is abridged from “Table of percentage points of the inverted beta distribution,” Biometrika, Vol. 33 (1943). It is here 
published with the king permission of authors, Maxine Merrington and Catherine M. Thompson, and the editor of Biometrika. 
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A 
a posteriori probability, 36 
a priori probability, 36, 433 
additive noise 
example, 736 
adjacent-sample difference, 
95-96 
adjoint operator, 480 
almost-diagonal covariance 
matrix 
dependent random variables 
example, 313-314 
almost surely continuity, 637 
almost surely convergence 
defined, 515 
alternative derivation 
of Poisson process, 553-554 
alternative hypothesis, 390 
analytic continuation, 494 
analytic signal process, 677 
annealing schedule, 775 
applied probability, 340 
arrival times, 458 
asymmetric Markov chain 
(AMC), 506-507 
asymmetric two-state Markov 
chain example, 506-507 
asymptotically stationary 
autocorrelation (ASA) 
function, 508 
asymptotically WSS, 485 


asymptotic wide-sense 
stationarity, 659 
asynchronous binary signaling 
(ABS) process, 548-550 
autocorrelation functions, 456 
ABS, 550 
RTS, 587 
WSS properties, 580-581 
autocorrelation impulse response 
(AIR), 488, 583 
autocorrelation matrix, 312 
autocovariance function, 457 
autoregression, 472, 503 
autoregressive moving average 
(ARMA), 503 
model, 766 
AR2 spectral estimate, 
770-771 
power spectral density, 767 
random sequence 
example, 769 
sequence, 503 
average power in frequency band 
theorem, 594-595 
average probability, 111 
averaging periodograms, 
763-767 
axiomatic definition 
of probability, 15—20 
axiomatic theory, 5, 702 


B 
bandlimited processes, 672-675 
defined, 673 
WSS 
example, 675 
theorem, 673 
bandpass random processes, 
675-678 
decomposition, 677 
Bartlett’s estimate, 767 
Bartlett averaging, 
763-767 
Bayes, Thomas, 35 
Bayes’ formula 
for probability density 
functions, 111 
Bayesian decision theory, 
391-395 
Bayes strategy, 395 
Bayes’ theorem 
proof, 35-38 
Bernoulli PMF, 102 
Bernoulli random sequence, 
447-448, 559 
Bernoulli RV, 344-345, 
366-369 
example, 158 
beta pdf, 99 
binomial law in Bernoulli trials, 
48-57 
binomial coefficient, 40 
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Index 


binomial counting sequence 
example, 522 
binomial distribution function, 
53 
binomial law 
asymptotic behavior, 57-63 
normal approximation, 63-65 
binomial PMF, 272 
binomial random variables 


sum 
example, 189-190 
variance, 243 
birth-death chain, 508 
birth-death Markov chains, 
567-571 
process, 567 
Boltzmann constant, 45 
Boltzmann law, 45 
Borel field, 14 
Borel function, 218 
Borel subsets, 81, 516 
Bose-Einstein statistics, 45-46 
bounded-input bounded-output 
(BIBO), 476 
black-lung disease 
recognition of, 296-298 
Brownian motion, 560-563 
Bucy, R.S., 725 


Cc 


carrier signal, 559 
Cauchy, Auguste Louis, 175 
Cauchy convergence criterion, 
513-521, 652 
Cauchy pdf, 142 
example of, 223-224 
Cauchy probability law, 175 
Cauchy-Schwarz inequality, 548, 
648 
Cauchy sequence 
of measurable functions, 515 
causal filter estimate, 708 
causally invertible, 720 
causally linearly equivalent, 726 
causal probability, 36 
causal Wiener filter, 737-738 
CDF, see Cumulative 
distribution 
function (CDF) 
centered Poisson process, 591 
centered process, 546, 574 
centered random sequence, 457 


central limit theorem (CLT), 
272, 276-281, 341, 421, 
449, 561 
example, 280 
central moment 
defined, 242 
certain event, 8 
chain rule of probability, 500 
change detector 
example, 575-576 
see also Edge detector 
Chapman-Kolmogorov equations, 
571-572 
characteristic equation, 473-474, 
509 
characteristic function (CF), 
267-270 
Normal law, 331-332 
proof, 276 
random vectors, 328-331 
Chebyshev, Pafnuti L., 255 
Chebyshev’s inequality, 255-261, 
350, 518 
Chernoff bound, 261, 264—266 
Chi-square pdf, 97-98 
with n degrees-of-freedom, 
228 
law 
example, 227-230 
probability density function, 97 
closed 
end points, 80 
intervals, 80 
coin tossing experiment 
example, 750-751 
collection of realizations, 442 
column vector, 314-316 
combinatorics, 38-48 
communications 
examples, 152, 192-193, 29, 36, 
146, 240 
complement, 10 
complete data 
log-likelihood function, 747 
complex random 
sequences, 456 
complex random process, 
675-677 
complex Schwarz inequality, 646 
composite hypotheses, 402—403 
F-test, 412-415 
generalized likelihood ratio 
test (GLRT), 403-408 


test for equality of means of 
two populations, 408-412 
variance of normal population, 
416-415 
compound Markov models, 
779-780 
computerized tomography, 251 
paradigm, 252 
conditional densities, 134-135 
conditional distributions, 
107-137 
functions, 110, 298 
conditional expectations, 
232-241 
properties, 240 
as random variable, 239 
communication system 
example, 240 
conditional failure rate, 138 
conditional mean (see also 
conditional mean), 241 
MMSE estimator theorem, 703 
conditional CDF, 109, 110 
conditional pdf, 110 
conditional expectation, 234 
linear combination, 298 
conditional probabilities, 20-26 
conditional representation 
example, 778 
confidence interval 
estimation, 363 
mean, 263-364 
confidence interval for median, 
428-429 
confidence intervals, 384 
conjugate symmetry property, 
581 
consistency, 356-357 
estimator, 347 
example, 455 
guaranteed, 455 
constant mean function, 465 
continuity 
probability measure, 452-454 
continuous entropy, 770 
continuous operator, 592 
continuous random variable, 
100-103 
continuous sample space 
random process, 545 
continuous system, 592 
continuous-time linear systems 
random inputs, 572-578 


Index 


continuous-time linear system 
theory, 471 
continuous-time Markov chain, 
504 
continuous-valued Markov 
process, 564 
continuous-valued Markov 
random 
sequences, 500-511 
continuous-valued random 
sequence, 456 
contours of constant 
density, 256 
joint Gaussian pdf, 253-255 
convergence 
of deterministic sequences, 514 
of functions, 514 
in probability, 515-521 
of random sequences 
example, 516-517 
for random sequences 
Venn diagram showing, 520 
convergent sequences 
example, 514 
convolution 
integral, 178 
theorem, 476 
two rectangular pulses 
illustration, 664 
convolution-type problems 
example, 178-180 
coordinate transformation in 
Normal case, 255 
correlated noise, 449 
example, 448-450 
correlated samples, 325 
correlation coefficient, 
134, 247 
calculating, 480 
coefficient estimate, 361 
correlation matrix, 312 
correlation window, 762 
correlation function, 464, 482, 
546 
definition of, 456, 464 
example, 488 
properties 
psd table, 586 
random sequence with memory 
example, 467-471 
Cosine transforms, 682 
countable additivity 
axiom, 446 


countable 
random variables, 546 
unions, 13 
countably additive, 447 
covariance, 360-361 
covariance function, 464, 546 
recursive system 
example, 483-486 
covariance matrices, 311-318 
almost-diagonal 
example, 313 
diagonalization, 316 
properties, 314-319 
whitening transformation, 
318-319 
cross-correlation function, 574 
example of, 485 
theorem, 480-481 
WSS properties, 580 
cross-covariance differential 
equation, 657 
cross-power spectral density, 490, 
592 
cumulative distribution function 
(CDF), 80, 104 
computation of F’x (x), 85-88 
conditional, 108, 110 
defined, 83 
joint, 118-123, 124 
properties of CDF Fx (x), 
84-85 
Tables of, 98, 104 
random vectors, 296 
random sequence, 454 
transformation of 
example, 160 
unconditional, 109, 298 
cyclostationary, 497 
processes, 600-605 
waveforms, 497 


D 
Davenport, Wilbur, 218 
decimation, 496-497 
example, 496 
decision function, 392 
deconvolution, 488 
decorrelation of random vectors 
example, 298-299 
decreasing sequence, 453 
degrees of freedom (DOF), 353 
deleted vector, 774 
De Moivre, Abraham, 277 


De Morgan, Augustus, 12 
De Morgan’s laws, 12 
densities of RVs, 107-137 
computation by 
induction, 459 
dependent random variables 
almost-diagonal covariance 
matrix 
example, 313-314 
derivative 
of quadratic forms, 383 
of scalar product, 382-383 
of Wiener process 
example, 640 
of WSS process 
example, 584 
detection theory 
application of K-L expansion 
example, 671 
deterministic autocorrelation 
operator, 762 
deterministic sequences 
convergence, 513 
deterministic vectors, 296 
deviation from the mean for a 
Normal RV 
example, 257 
diagonal dominance, 548 
diagonalization 
of covariance matrices, 316 
simultaneous of two matrices, 
319-264 
see also Whitening 
difference 
of two sets, 10 
differential equations, 
596-600 
example of, 472-473 
solution of, 472-473 
digital modulation, 548 
PSK, 558 
Dirac, Paul A. M., 101 
Dirac delta functions, 101 
direct dependence, 449, 599 
concept, 502 
direct method for pdf’s, 166 
discrete convolution 
of PMFs, 271-272 
discrete random vector, 102-103, 
328-331 
discrete-time Fourier transform, 
489 
discrete-time impulse, 460 
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discrete-time linear systems 

principles, 471-474 

shift invariant, 477 

discrete-time Markov chains, 504 

defined, 504 

discrete-time signal, 475 

discrete-time simulation, 493 

and synthesis of sequences, 
493-496 

discrete-time systems review, 471 

discrete-valued Markov random 

sequence, 500-503, 564 

discrete random variables, 

100-101 

distance preservation, 316 

see also Unitary 

distribution-free estimation, 384 

distribution-free hypothesis 

testing, 429-432 

distribution-free/nonparametric 

statistics, 372 

distribution function, 83-88 

doubly stochastic, 113 

driven solution, 611 

Durant, John, 7 

dynamic programming 

principle, 757 


E 
edge detector 
example of, 482-483, 575-576 
input correlation function, 483 
using impulse response, 
487-488 
Eigenfunctions, 476 
Eigenvalues, 314-316 
EKigenvector, 314-316, 318 
matrix, 316-317 
electric-circuit theory 
example, 151-152 
elementary events, 17 
Emission tomography (ET) 
application of E-M to, 744-747 
energy function, 776 
energy norm, 517 
E linear operator, 709 
ergodic in correlation 
in WSS random process 
defined, 662 
ergodic in distribution 
function, 664 
random process, 665 
ergodic in power, 664 


ergodic in the mean 
example, 665 
in WSS random process 
defined, 644 
theorem, 662 
ergodicity, 659-666 
covariance functions, 666 
Erlang pdf, 459 
error function, 91—92 
error probability, 398 
estimation, 260, 346 
consistent, 347 
of covariance and means, 
376-380 
expectation and introduction, 
215-292 
minimum-variance unbiased, 
347 
MMSE, 347 
multidimensional distribution, 
302 
observation vector, 347 
of signal in noise 
example, 711 
vector means, 376—280 
estimators, 260, 264, 340, 346, 
348 
maximum likelihood, 365 
parametric, 372 
linear, 692 
minimum MSE (MMSE), 703 
linear MMSE (LMMSEB), 711 
Kalman, 730 
Wiener, 734 
maximum-likelihood (ML), 739 
maximum a posteriori (MAP), 
755 
spectral (psd), 760 
simulated annealing, 773 
Euclidean distance, 316 
Euclidean sample spaces, 14 
Euler’s summation formula, 33 
event probabilities 
normal approximation, 64 
events, 8-14 
exclusive-or 
of two sets, 10 
expectation, 215 
operator, 224 
linearity of, 243 
of a RV, 215 
f a random vector, 311-313 
f a discrete RV, 217 


° 


° 


expectation-maximization (E-M) 
algorithm, 333, 739-741 
and exponential probability 
functions, 744-745 
expected value, (see moments) 
Tables of, 246 
exponential autocorrelation 
function, 587 
exponential pdf, 95 
exponential probability 
functions 
in E-M algorithm, 744—745 
exponential RV, 191 


F 
failure rates, 137-141 
failure time, 565 
feedback filter, 449 
Feller, William, 38, 218 
Fermi-Dirac statistics, 46 
fields, 8, 13 
filtered-convolution 
back-projection, 252 
filtering 
of independent 
sequences, 449 
finite additivity, 446 
finite capacity buffer 
example, 570-571 
finite energy norm, 517 
finite state space, 504 
finite-state Markov chain, 504 
first-order signal in white noise 
example, 738 
Fisher, Ronald Aylmer, 99 
force of mortality, see 
Conditional failure rate 
forward-backward procedure, 
755-757 
Fourier series 
expansion, 682 
WSS processes, 681-683 
Fourier transform, 113, 266-267, 
471, 475 
WSS, 761-763 
frequency function, 101 
frequency of occurrence 
measure, 4—5 
F-test, 412-415 
function-of-a-random-variable 
(FRV) problems, 151-153 
functions 
of random variables, 151—205 
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G Gibbs sampler, 774-775 ranking test for sameness of 

gamma pdf, 98 (see also Erlang algorithm, 775 two populations, 432-433 
pdf) GLRT, see Generalized 


Gauss, Carl F., 89 
Gaussian (see Normal) 
data in LMMSE, 709 
pdf, 89 
random vector, 319-328 
noise, 449 
density, 89 
characteristic function 
conversion, 268 
standard Normal, 91—95 
joint Gaussian, 253 
marginal, 252 
Gaussian law, 302 
Gaussian random process 
defined, 562 
Gaussian random sequence, 460, 
A478 
predicting, 723-724 
Gaussian random 
vector, 460 
Gauss-Markov models 
causal, 775 
noncausal, 775-779 
Gauss-Markov vector random 
process, 611 
Gauss Markov random sequence 
example, 501 
noncausal 
example, 777 
theorem, 723-724 
generalized eigenvalue, 319 
equations, 320 
generalized likelihood ratio test 
(GLRT), 403-408, 433 
generalized m.s. derivative 
conventional process, 643-645 
generalized random 
process, 643 
generator 
Markov chain, 567 
generator matrix 
Markov chain, 569 
generic linear system 
system diagram, 471 
generic two-channel LSI system, 
607 
geometric series, 45, A-3 
geometric RV, 232 
Gemans’ line sequence, 780 
Gibbs line sequence, 780-784 


likelihood ratio test 
(GLRT) 

goodness of fit, 417 

Gossett, W. S., 99 

growing memory estimate, 713, 
17 


H 
half-open 
end points, 80 
intervals, 80 
half-wave rectifier 
example, 158-159 
Hamersley-Clifford theorem, 776 
hard clipper, 557 
hazard rate, see conditional 
failure rate 
Helstrom, Carl, 225 
Hermitian kernel, 684 
Hermitian matrices, 312 
Hermitian semidefinite, 683, 684 
Hermitian symmetry, 457, 548 
hidden Markov models (HMM), 
750-760 
specification of, 752-754 
Hilbert linear-space concept 
random variable, 719 
Hilbert space, 647 
Hilbert transform operator, 676 
homogeneous equation, 472-473 
hypothesis testing, 390-391, 433 
Bayesian decision theory, 
391-395 
composite hypotheses, 402—403 
F-test, 412-415 
generalized likelihood ratio 
test (GLRT), 403-408 
test for equality of means of 
two populations, 408-412 
variance of normal 
population, 416-415 
goodness of FIT, 417-423 
likelihood ratio test, 396-402 
ordering, percentiles, and rank, 
423-428 
confidence interval for 
median, 428-429 
distribution-free hypothesis 
testing, 429-432 


I 
impossible event, 10 
impulse, 101 
(see also discrete-time 
impulse) 
function, 101 
response, 475 
increasing sequence, 452-453 
theorem, 452 
independence, 20-26 
definitions of, 21—22 
independent and identically 
distributed (i.i.d.), 162, 
174, 186, 189, 341 
and LLN, 260 
and CLT, 277 
sum of i.i.d. binomial RVs, 271 
independent increments, 552-553 
property 
defined, 463 
random sequence, 464, 523 
independent random sequence, 
444 
independent random process, 579 
independent random variables, 
125-126 
sum of, 177-182 
independent random 
vectors, 312 
indicator process, 664 
indirect dependence, 599 
induction, 11 
infinite intersections, 445 
infinite length Bernoulli trials, 
447-448 
infinite length queues 
birth-death process, 568 
infinite length random sequences, 
445 
infinite root transmittance 
example of, 170-171 
infinitesimal parallelepiped, 
300-301 
infinitesimal parallelogram, 197 
infinitesimal rectangle, 197 
infinitesimal rectangular 
parallelepiped, 300-301 
infinitesimal volume 
ratio of, 300 
initial rest condition, 474 
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inner products, 259 
for random variables, 702 
innovations decomposition 
LMMSE predictor, 721 
innovation sequences, 719-734 
defined, 720 
theorem, 722 
in-phase component of signal, 
676 
input correlation function 
edge detector, 483 
input moment functions, 478 
instantaneous failure rate, see 
Conditional failure rate 
integral equations, 683-687 
integral of white noise 
example of, 652 
intensity, see mean-arrival rate 
intensity rate, see conditional 
failure rate 
interarrival times, 458 
interpolation, 497—500 
example of, 496 
interpolative representation 
example of, 778 
interpretation 
of psd, 490, 586-587 
intersection of sets, 10 
intuition, 2-3 
intuitive probability, 3 
invariance property 
of MLE, 369 
inverse Fourier transform, 113, 
273, 475, 532, 584 
inverse image, 81 
inverse two-sided Laplace 
transform, 598, A-3 


J 
Jacobian, 322, 586 
computation, 302 
magnitude, 309 
transformation, 198 
joint characteristic functions, 
273-276 
example, 274-275 
joint densities 
of random variables, 116-134 
of random vectors, 295-298 
joint distribution, 107-137 
of random vectors, 295-298 
joint Gaussian density 
graph of, 132 


joint Gaussian distribution, 252 
MMSE estimate 
theorem, 705-706 
zero means, 705-706 
joint Gaussian pdf, 132, 250 
contour of constant density, 
253-255 
joint Gaussian random variables, 
251-253 
joint independent random 
sequences, 725 
joint moments, 246-248 
defined, 246—247 
joint PMF, 328-331 
defined 
and conditional expectation, 
234-235 
joint probability 
of events, 20-35 
joint stationary random 
processes, 581—600 


K 
Kalman, R.E., 725 
Kalman-Bucy prediction filter, 
725 
Kalman filter, 503, 511-512, 
719-734 
equation, 729 
formula, 729 
system diagram of, 730 
Kalman gain matrix, 728 
Kalman predictor, 725-730 
system diagram of, 728 
Kalman smoothers, 730 
Karhunen-Loeve (K-L) 
expansion, 666-672 
theorem, 666 
white noise example, 669 
Wiener process example, 670 
Karhunen-Loeve (K-L) 
transform, 682 
kernel 
of integral equation, 683 
Kolmogorov, Andrei, 2, 14, 446, 
448 
Kronecker delta function, 667 
see also discrete-time impulse 


L 
Lagrange method, 245 
for spectral estimation, 771 


Laplace pdf, 96 
Laplace transform, 597 
law of large numbers, 
259-260 
in statistics, 371 
weak laws, 521-522 
strong law, 525 
convergence, 521-526 
Lebesgue measure, 516 
eft-out data, 740 
eft-to-right model of HMM 
example, 754-755 
Levinson-Durbin algorithm, 714 
ikelihood function, 366 
ikelihood ratio test, 396-402 
ikelihood ration test (LRT), 433 
imit 
various types of, 636 
inear amplifier with cutoff 
example of, 169-171 
inear combination 
of conditional pdf, 298 
inear constant coefficient 
differential equation 
(LCCDE), 472, 596, 654, 
719 
example, 511, 599 
inear continuous-time system 
defined, 573 
inear differential equations 
(LDEs) random processes, 
555 
inear estimation, 347 
minimizing the MSE, 709 
of vector parameters, 380-384 
inearity 
expectation operator, 548 
inear minimum mean-square 
error (LMMSE), 709-710 
filter, 725 
one-step predictor, 716 
linear operator, 471 
linear prediction 
example of, 249-250 
first-order WSS Markov 
process example, 714 
linear predictive coding (LPC), 
754-755 
linear regression 
example, 249-250 
linear shift-invariant (LSI), 
474-475 
systems, 581-600 


Index 


linear systems 
application example, 653-654 
with input random sequence, 
ATT-A78 
WSS inputs 
input/output relations, 594 
linear time-invariant (LTI), 474 
see also Linear shift-invariant 
(LSI) 
linear transformation 
log-likelihood, 741-743 
line process, 779 
line sequence, 779 
log-likelihood 
function, 366 
of complete data, 747 
of linear transformation, 
741-743 
loss functions, 391 
lowpass filter 
example, 480 


M 
marginal density, 298 
marginal pdf, 311, 330 
defined, 225 
random vector, 298 
Markov, A. A., 471 
Markov chain, 504-511, 564 
asymmetric two-state 
example of, 506-507 
birth-death, 567-569 
continuous-time, 504 
discrete-time, 504 
defined, 504 
finite-state, 504 
generator matrix, 569 
Markov inequality, 257-258 
Markov models 
compound, 779 
hidden (HMM), 750-760 
Markov sequences of order-2 
wide-sense, 723 
Markov-p random sequence, 
502-503 
defined, 502, 504 
scalar, 510-511 
example, 513 
Markov process 
continuous-valued, 564 
linear prediction for a 
first-order WSS 
example, 714 


Markov random process, 
563-567 
defined, 564 
vector 
defined, 610 
Markov random sequence, 471, 
500-511 
continuous-valued, 500-511 
discrete-valued, 500-511, 564 
Markov state diagram 
for birth-death process, 568 
Markov vector random sequence, 
512-513 
Markov zero-mean sequence 
wide-sense 
theorem, 722 
Martingale, 522 
Martingale convergence theorem, 
524-526, 708 
Martingale sequence 
theorem, 523-524 
matched filter, 672 
MATLAB 
computation, 132-132 
computing average number of 
calls 
example, 230-232 
computing the probability 
example, 229-230 
methods 
simulation, 450-451 
periodogram and Bartlett’s 
estimator, 765-766 
psd, 588 
example, 491—492 
maximum a posteriori 
probability (MAP), 773 
maximum entropy (ME) 
principle 
example of, 244-246 
maximum entropy spectral 
density, 770 
maximum likelihood (ML) 
principle, 365-366 
maximum-likelihood estimator 
(MLE), 365-369, 384, 402, 
739-749 
max operator, see Supremum 
operator 
Maxwell-Boltzmann statistics, 45 
mean and variance, simultaneous 
estimation of, 361-363 
mean-arrival rate, 552 


mean confidence interval for, 
352-354 
mean-estimator function (MEF), 
348, 349-352, 365 
mean function 
of random sequences, 546 
mean-square calculus, 636-651 
defined, 639 
theorem, 639 
mean-square continuous 
defined, 637 
example, 637-639 
theorem, 637 
mean-square convergence 
defined, 517 
mean-square derivative 
of mean and correlation 
functions, 641 
example, 643 
theorem, 643 
of WSS random process, 644 
mean-square description 
system operation, 478 
mean-square ergodic, 660 
mean-square error (MSE), 241, 
249 
minimum MSE (MMSE), 347, 
703 
linear MMSE (LMMSB), 695 
mean-square integral 
defined, 651 
related to Wiener process, 653 
mean-square periodic, 600, 678 
mean-square stochastic 
differential equations, 
654-659 
mean-square stochastic integrals, 
574, 651-654 
mean values, 
Tables of, 246 
measurable function, 515 
measure theory, 446 
memoryless property 
of exponential pdf, 552 
Mercer’s theorem, 669, 684, 687 
Merzbacher, Eugen, 1-2 
minimum cost path, 757 
minimum mean-square error 
(MMSB), 347, 702 
and the conditional mean 
complex random vectors, 
707-708 
estimator, 347-348 
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minimum mean-square error 
(MMSE) (continued) 
and the conditional mean 
theorem, 703 
orthogonality condition, 
705-706 
vector, 704 
minimum-variance unbiased 
estimator, 347 
miscalculations 
in probability, 7-8 
misuses in probability, 7-8 
mixed random sequence, 456 
mixed random variables, 100-107 
mixture distribution function, 
298 
mixture pdf, 298 
modified trellis diagram, 507 
moments, 215, 242-255, 302 
estimator, 263 
moment generating function 
(MGF), 261 
of random sequence, 456, 478 
Tables of, 246 
monte-Carlo simulation, 280 
moving average, 289, 472, 503 
E-M algorithm, 739 
summary, 743 
example, 746-749 
multidimensional Gaussian law, 
303, 319-328 
multidimensional Gaussian pdf, 
313 
multinomial Bernoulli trials, 
48-57 
multinomial coefficient, 41 
multinomial formula, 54—57 
exercises dealing with, 72 
multiple-parameter ML 
estimation, 367 
multiple transformation 
of random variables, 299-302 
multiplier (Product of RVs) 
example, 173-174 
multiprocessor reliability 
example, 565 


N 
Neyman, J., 99 
Neyman-—Pearson theorem 
(NPT), 401-402 
noise 
atmospheric, 5 


communication channel, 29 
correlated, 448 
Gaussian noise, 153 
narrow-band, 192 
noise voltage, 90 
resistor noise, 142 
white noise, 382, 494 
noncausal Gauss-Markov models, 
775-779 
random sequence example, 
777-778 
non-Gaussian parameters, 
363-365 
nonindependent random 
variables 
joint densities of, 129-130 
nonlinear devices 
example, 169-170 
nonmeasurable subsets, 80 
nonnegative random variables, 
257 
nonnegative RV, 257 
nonstationary first-order Erlang 
density, 551, (see also 459) 
nonparametric statistics, 425 
Normal approximation, 376, 
428-429 
to binomial law, 63-65 
to event probabilities, 64 
to Poisson law, 65 
see also Gaussian 
Normal law, 63 
normal equations for linear 
prediction 
example, 713-714 
normalized covariance, 237, 313, 
361 
normalized frequency, 476 
Normal 
characteristic function of, 
331-332 
Normal (Gaussian) pdf, 89 
Normal random vector, 322-325 
NPT, see Neyman—Pearson 
theorem (NPT) 
numerical average, 348 


Oo 

observation noise, 725 

observation vector 
estimator, 347 

occupancy numbers, 43 

occupancy problems, 42 


open sets 
and end points, 80 
intervals, 80 
operator BE 
properties of, 717—719 
theorem, 717-718 
operator L, 472 
linear, 471 
optimal linear 
LMMSE, 709 
optimum linear interpolation 
example, 714-715 
optimum linear prediction 
example, 249-250 
ordered random variables, 
302-305 
distribution of area random 
variables, 305-311 
ordered sample, 39 
ordering 
subpopulation, 39 
ordering, percentiles, and rank, 
423-428 
confidence interval for median, 
428-429 
distribution-free hypothesis 
testing, 429-432 
ranking test for sameness of 
two populations, 432-433 
orthogonal 
random processes, 578 
random vector, 312 
orthogonality condition 
MMSE estimate, 705-706 
orthogonality equations, 706 
orthogonal projection operator, 
see Operator E 
orthogonal random vector, 312 
orthogonal unit eigenvectors, 317 
orthonormal eigenvectors 
computation, 325 
outcomes, 3-4 
output autocorrelation function, 
482 
WSS, 582-583 
output-correlation functions 
theorem, 480-482 
output covariance 
calculating, 480 
output moment functions, 478 
output random sequence mean 
theorem, 478-480 
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packet switching 
example, 570 
Papoulis, Athanasios, 155 
paradoxes 
in probability, 7-8 
parallelepipeds 
union and intersection, 297 
parallel operation (maximum 
operation) 
example, 175 
parameter estimation, 340-384 
estimators, 346-348 
independent, identically 
distributed (i.i.d.) 
observations, 341-343 
linear estimation of vector 
parameters, 380-384 
maximum likelihood 
estimators, 365-369 
mean and variance, 
simultaneous estimation 
of, 361-363 
mean, estimation of, 348-349 
6-confidence interval, 352, 
355 
mean-estimator function 
(MEF), 349-352 
normal distribution, 352-354 
median of population versus 
its mean, 371-372 
non-Gaussian parameters from 
large samples, 363-365 
parametric versus 
nonparametric statistics, 
369-371, 372-373 
confidence interval for 
median when n is large, 
375-376 
confidence interval on 
percentile, 373-375 
median of population versus 
its mean, 371-372 
probabilities, estimation of, 
343-346 
variance and covariance, 
355-357 
confidence interval, 357-359 
covariance, estimating, 
360-361 
standard deviation directly, 
estimating, 359-360 
vector means and covariance 


matrices, 376-377 
lu, estimation of, 377-378 
covariance K, estimation of, 
378-380 
parametric case, 425 
parametric spectral estimate, 
768-770 
parametric statistics, 372, 425 
particular solution, 473 
Pauli, Wolfgang, 46 
P-convergence, 520 
Pearson, E. S., 99 
Pearson test statistic, 419, 421 
periodic processes, 600-606, 
672-683 
periodogram, 761-763 
estimate 
ARMA model, 766 
Pdf, see Probability density 
function 
phase recovery, 488 
phase-shift keying (PSK), 548 
digital modulation, 558 
phase space, 45 
Planck’s constant, 45 
PMF, see probability mass 
function 
points, 59 
poisson counting process, 548, 
550-554, 638 
Poisson law, 57-63 
random variable, 102, 102—103 
exercises dealing with, 73 
compound Poisson, 113 
Poisson process, 550 
alternative derivation, 555-557 
sum of two independent 
example, 554 
Poisson characteristic function, 
331 
sum 
example, 188 
Poisson rate parameter, 60 
Poisson transform, 113, 236 
population, 341, 423 
positive definite, 314, 684-685 
positively correlated, 253 
positive semidefinite, 314-315, 
548, 683 
autocorrelation functions 
property, 486 
correlation function 
theorem, 596 


power spectral density (psd), 
349, 351-353, 443-448 
correlation function properties 
table, 585 
defined, 584 
interpretation, 490, 586-596 
properties, 489 
PSK 
example, 604-606 
stationary random sequences, 
491-492 
transfer function, 593 
triangular autocorrelation 
example, 589 
white noise 
example, 586 
predicted value, 237 
prediction-error covariance 
equation, 732 
prediction-error variance, 733 
prediction-error variance matrix, 
730 
prima facie evidence, 250 
principle of mathematical 
induction, 459 
probability 
axiomatic definition of, 15-20 
continuity, 637 
estimation of, 343-346 
exercises dealing with, 66—77 
theory of, 17 
types, 2-6 
probability-1 (almost sure) 
convergence, 515-521 
probability density function 
(pdf), 88-100, 217, 296, 
504, 739-749 
Bayes’ formula, 11 
Cauchy RV 
example, 223-224 
Chi-square, 228 
conditional, 110 
conditional expectation, 234 
linear combination, 298 
Erlang RV, 459 
exponential pdf, 95 
Gaussian, 89-90, 256, 
268, 332 
conversion, 91—93 
Gaussian marginal, 252 
joint 
conditional expectation, 234 
joint Gaussian, 200 
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probability density function 
(pdf), (continued) 
contour of constant density, 

254-255 

Laplacian pdf, 96 

marginal, 225, 298, 311, 330 

mixture, 298 

multidimensional Gaussian, 
313 


Normal (Gaussian) RV, 89-90, 


256, 268, 332; see also 
Gaussian 
Rayleigh pdf, 95, 192 
Rice-Nakagami, 192 
Table of, 98 
uniform pdf, 95 
univariate normal, 89, 322 
probability laws, 48-57 
exercises dealing with, 72 
probability mass function 
(PMF), 87, 100, 217, 366, 
504, 739-749 
discrete convolution, 271 
Poisson counting process, 552 
probability measure 
continuity, 452-454 
probability space, 14 


Q 
quadrature component, 676 
quantizing 

in A/D conversion, 161 

example, 161—164 

in image compression, 97 
queueing process, 567 
queue length 

finite, 569-571 

infinite, 568-569 


R 
radioactivity monitor 
example, 553-554 
random complex exponential 
example, 580 
random inputs 
continuous-time linear 
systems, 572-578 
random process, 543-612, 
636-683 
applications of, 701-784 
classifications of, 578-580 
defined, 544-548 


exercises dealing with, 
612-634, 683-700, 784-788 
generated 
from random sequences, 572 
white noise 
example, 669 


random pulse sequence 


example, 519 


random sample of size n, 341 
random sequence, 441-526 


applications of, 701—784 
exercises dealing with, 
784-788 
concepts, 442-471 
consistency of higher-order 
cdf’s, 455 
convergence of, 513-521 
defined, 442-443 
exercises dealing with, 526-541 
finite support 
example of, 443 
illustration of, 442 
input/output relations, 493 
linear systems and, 477-486 
random process generated, 572 
statistical specification of, 
454-471 
synthesis of, 493-496 
tree diagram of 
example, 444-445 


random telegraph signal (RTS), 


548, 557-558 
autocorrelation function of, 
587 


random variables, 79-141, 515 


definition of, 80-83 

estimation of, 701-719 

exercises dealing with, 141-149 

functions of, 151-205 

input/output view, 154-155 

multiple transformation of, 
299-302 

symbolic representation, 81 


random vectors 


characteristic functions of, 
328-331 

characterized as, 302 

classified as, 312 

estimate of, 702 

expectation vectors and 
covariance matrices, 
311-213 

functionally independent, 299 


joint densities, 295-299 
marginal pdf, 298 
random walk problem 
displacement, 173 
random walk sequence 
example, 461—463 
ranking test for sameness of two 
populations, 432—433 
Rayleigh density function, 192 
Rayleigh distribution, 192 
Rayleigh law, 327 
Rayleigh pdf, 95, 192 
realizations 
of random sequence, 442 
real symmetric, matrices, 312 
real-valued random process 
example of, 576-577 
theorem regarding, 595 
real-valued random variable 
example of, 580 
recursive algorithm, 755—757 
region of absolute convergence, 
477 
relative frequency approach, 4 
renewal process, 557 
Rice, S. O., 192 
Rice—Nakagami pdf, 192 
Rician density 
example, 192-193 
Riemann, Bernhard, 219 
Riemann sum, 219 
rotational transformer, 194 
running time-average 
example, 504 


Ss 
sample space, 8-9, 296 
space, 296 
sample-function continuity, 637 
sample mean, 348 
sample mean estimator (SME), 
363 
example, 260 
sample sequence, 442 
construction example, 450 
random walk, 462 
sample space, 8 
illustration, 442 
sampling 
distribution, 353 
with replacement, 39 
theory, 499 
without replacement, 39-40 
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scalar Kalman filter 
example, 732 
scalar Markov-p 
example, 513 
scalar product, 259 
derivative of, 381-383 
scalar random sequence, 511—512 
Schwarz, H. Amandus, 258 
Schwarz inequalities, 255-261, 
642, 685 
complex, 646 
second-order joint moments, 246 
semidefinite functions 
defined, 548 
separability of random process, 
578 
separable random sequences, 546 
example, 343 
set algebra, 10 
sets, 8-14 
shift-invariant, 465, 474 
covariance function, 485 
short-time state-transition 
diagram, 566 
sigma algebra, 13 
sigma fields, 13 
signal plus additive noise, 782 
simulated annealing, 773-784 
noisy pulse sequence 
example, 781-784 
simultaneous diagonalization 
two covariance matrices, 319 
sine wave, 164-166 
six-dimensional space, 45 
smoothing, 735-737 
spectral estimation, 760-613 
spectral factorization, 494 
speech processing 
application, 754 
square-law detector 
example, 157-158, 184-187 
standard deviation, 216 
standard Normal density, 63 
see also Gaussian 
standard Normal 
distribution, 63 
see also Gaussian 
standard Wiener process 
example, 637-638 
state conditional output symbol 
probabilities, 750 
state equations, 511-513, 
607-612 


state of the process, 564 
state-transition diagram, 467, 
566 
concept 
example, 504-505 
state-transition matrix, 504 
state-variable representation, 513 
stationary, 184 
of order 2, 665 
processes, 596-600 
random process 
defined, 578-579 
random sequences, 464—466 
psd, 492-493 
statistically orthogonal, 718 
statistically specified 
random process, 545 
statistical pattern recognition, 
319 
statistical signal processing 
applications, 701-784 
exercises dealing with, 
784-788 
statistical specification 
of random sequence, 454-471 
of random process, 544 
steady state, 510 
autocorrelation function 
asymptotic stationary 
(ASA), 508 
Stieltjes integral, 456 
Stirling, James, 57 
Stirling’s formula, 57 
stochastic continuity, 636-645 
types, 637 
stochastic derivatives, 636-645 
stochastic differential equations, 
654-459 
stochastic iterative procedure, 
774 
stochastic processes 
transformation, 572-578 
strict-sense stationary, 663 
Strong Law of Large Numbers, 
516 
theorem, 525 
student-t pdf, 98 
subpopulation, 39-41 
sufficient statistic, 744 
supremum operator, 517 
superposition, summation, 474 
sure convergence 
defined, 515 


symmetric exponential 
correlation function 
RTS, 558 
system function, 477 


T 
Taylor series, 278 
temporally coherent, 136 
test for equality of means of two 
populations, 408-412 
time-variant impulse response, 
474 
Toeplitz property, 717 
total probabilities, 20-35 
transfer function 
LSI system 
example, 593 
transformation of CDFs 
example of, 160 
transition probabilities, 564 
transition time, 565 
trapping state, 511 
Trellis diagram 
Markov chain, 507 
triangle inequality, 647 
triangular autocorrelation 
function, 589 
tri-diagnonal correlation function 
diagram, 461 
two-state random sequence with 
memory 
example, 466—467 
two-variable-to-two variable 
matrixer, 194 


U 
unbiased estimate, 761 
unbiasedness, 349 
estimator, 422, 425 
unconditional CDF, 110, 571 
unconditional probability, 34 
uncorrelated 
random processes, 578 
random variables 
properties of, 248-249 
random vector, 312 
samples, 325 
sequence, 449 
uncountable, 16, 218 
uncoupled two-channel LSI 
system, 607 
uniform law, 90 
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uniform pdf, 95 
uniform random number 
generators (URNG), 281 

union of sets (events), 10 
unitary 

matrices, 316 
unit-step function, 551 
univariate normal pdf, 89, 322 
universal set, 10 
unrealizable estimator, 735-737 
upsampled (expansion), 97-498 


Vv 


variance and covariance, 355-357 
confidence interval, 357-359 
covariance, estimating, 

360-361 
standard deviation directly, 
estimating, 359-360 
Tables of, 246 
variance-estimator function 
(VEF), 348-349 

variance function, 457 

variance of normal population, 
416-415 

variation of parameters, 556 

vector convolution 
defined, 609 

vector Markov random sequence, 

512 
example, 513 

vector Markov random process 
defined, 610 

vector means and covariance 

matrices, 376-377 
Hu, estimation of, 377-378 
covariance K, estimation of, 
378-380 

vector parameters 
linear estimation of, 380-383 

vector processes, 607-612 

vector random sequence, 511-513 

Venn diagram, 11 


axiomatic definition of 
probability, 15-20 
V=e2(X, Y), W=h(X, Y ) 
problems of type, 193-200 
Viterbi algorithm, 508, 757-760 
example, 759-760 
Von Mises, Richard, 2, 4 


WwW 
waiting times, 458 
example, 457-458 
weak law-nonuniform variance 
theorem, 521 
weak law of large numbers, 660 
theorem, 521 
weighted average, 216 
white Gaussian noise, 644 
white Gaussian random 
sequence, 513 
white Gaussian zero-mean 
sequence 
theorem, 723-724 
whitening, 317, 318 
transformation, 318 
white noise, 577, 590 
estimating a signal in, 711 
wide-sense cyclostationary 
random process 
defined, 602 
wide-sense Markov of order 2, 
723 
wide-sense Markov zero-mean 
sequence 
theorem, 722 
wide-sense periodic stationary, 
600 
wide-sense stationary (WSS), 
465-466, 547 
bandpass random process, 675 
covariance function 
example, 465 
cross-correlation matrices, 512 
defined for, 464 
periodic processes, 678-681 


defined, 678 
example, 679-681 
theorem, 679 
processes, 581-600 
derivative example, 584 
Fourier series, 681—683 
PSK 
example, 604 
random process 
defined, 580 
m.s. periodic theorem, 679 
random sequences, 486—500 
defined, 486 
input/output relations, 493 
zero-mean random sequence, 
761 
Wiener, Norbert, 560 
Wiener filters 
for random sequences, 734-738 
Wiener—Levy process, 561 
Wiener process, 548, 560-564, 
591 
m.s. integral, 652 
Wishart distribution, 379 


Y 
Y = g(X) problems, 155-171 
general formula of 
determining, 166-167 


Z 
zero crossing 
information in, 557 
zero-input solution, 611 
zero-mean Gaussian RV, 237 
zero-mean random sequence 
example of, 495 
zero-order modified Bessel 
function, 193 
zero-state solution, 611 
Z = g9(X, Y ) 
solving problems of type, 
155-171 
Z-transforms, 471, 477, 484 


